<a href="https://colab.research.google.com/github/mukeshrock7897/Data-Analysis/blob/main/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **What is Pandas?**
* Pandas is a Python library used for working with data sets.
* It has functions for cleaning, analyzing, exploring, and manipulating data.
* The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
* Pandas created by Wes McKinney in 2008.

### **Why Use Pandas?**

**1. Efficient Data Structures:**

- **Series:** A one-dimensional labeled array capable of holding any data type (integers, floats, strings, objects).
- **DataFrame:** A two-dimensional labeled data structure with columns that can hold different data types, similar to a spreadsheet.
- **Index:** A unique identifier for each row or column, allowing for efficient data access and manipulation.

**2. Data Manipulation:**

- **Selection and Filtering:** Easily select specific rows, columns, or subsets of data based on conditions.
- **Aggregation:** Calculate summary statistics like mean, median, standard deviation, and more.
- **Grouping:** Group data by specific columns and perform calculations on each group.
- **Joining and Merging:** Combine data from multiple DataFrames based on shared columns or indexes.
- **Reshaping:** Transform data into different formats, such as pivoting or stacking.

**3. Data Cleaning and Preparation:**

- **Handling Missing Values:** Fill missing values, drop rows or columns with missing data, or interpolate values.
- **Data Formatting:** Convert data types, normalize data, and handle outliers.
- **Text Processing:** Extract information from text data using regular expressions and other techniques.

**4. Integration with Other Libraries:**

- **Seaborn:** Create visually appealing statistical plots.
- **Matplotlib:** Customize visualizations in more detail.
- **Scikit-learn:** Build machine learning models using Pandas-prepared data.
- **Statsmodels:** Perform statistical tests and modeling.

**5. Large Datasets:**

- **Efficient Handling:** Pandas is optimized for working with large datasets, providing efficient memory management and operations.
- **Performance:** Leverage Pandas's optimized algorithms and data structures for faster data analysis.

**6. Readability and Maintainability:**

- **Clear Code:** Pandas's intuitive syntax and expressive functions make your code more readable and easier to understand.
- **Maintainability:** Well-structured Pandas code is easier to maintain and modify over time.


### **What Can Pandas Do?**
Pandas gives you answers about the data. Like:

* Is there a correlation between two or more columns?
* What is average value?
* Max value?
* Min value?

**Note:** Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.



### **Install Pandas**
!pip install pandas


### **Importing Pandas**
import pandas as pd


### **Checking Pandas Version**
import pandas as pd

print(pd.__version__)

---
---
---
### **Pandas Series**
* **Pandas Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.).
* **Series** is like a column in a DataFrame or a more powerful version of a NumPy array.
* Each entry in a Series has a label (index), making it easier to access data.


---

### 1. **Creating a Pandas Series**

#### a. **From a List**
You can create a Series directly from a Python list.
```python
import pandas as pd

data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)
```
Output:
```
0    10
1    20
2    30
3    40
dtype: int64
```

#### b. **From a Dictionary**
You can create a Series from a Python dictionary, where the keys become the labels (index).
```python
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)
print(s)
```
Output:
```
a    10
b    20
c    30
dtype: int64
```

#### c. **With Custom Index**
You can specify a custom index (labels) when creating a Series.
```python
data = [100, 200, 300]
s = pd.Series(data, index=['x', 'y', 'z'])
print(s)
```
Output:
```
x    100
y    200
z    300
dtype: int64
```

#### d. **From a Scalar Value**
If you provide a scalar value, the same value is repeated for each index.
```python
s = pd.Series(5, index=['a', 'b', 'c'])
print(s)
```
Output:
```
a    5
b    5
c    5
dtype: int64
```

---

### 2. **Accessing Data in a Series**

#### a. **Accessing by Label (`.loc[]`)**
Use `.loc[]` to access data using labels.
```python
print(s.loc['y'])  # Output: 200
```

#### b. **Accessing by Position (`.iloc[]`)**
Use `.iloc[]` to access data by position.
```python
print(s.iloc[1])  # Output: 200
```

#### c. **Accessing by Boolean Mask**
You can filter a Series based on conditions using Boolean indexing.
```python
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s[s > 20])  # Output: c    30, d    40
```

---

### 3. **Modifying a Series**

#### a. **Modifying Values**
You can modify values in a Series by index label or position.
```python
s['y'] = 500
print(s)
```

#### b. **Adding or Removing Data**
Pandas Series are mutable, so you can add or remove elements dynamically.
```python
s['new'] = 600  # Adding a new element
print(s)

s = s.drop('new')  # Removing an element
print(s)
```

---

### 4. **Series Operations**

#### a. **Arithmetic Operations**
Operations on Series are performed element-wise, and labels are automatically aligned.

```python
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2)
```
Output:
```
a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64
```

#### b. **Mathematical Functions**
You can apply mathematical functions like `sum()`, `mean()`, and `std()` to a Series.
```python
s = pd.Series([1, 2, 3, 4, 5])
print(s.sum())    # Output: 15
print(s.mean())   # Output: 3.0
print(s.std())    # Output: 1.58 (Standard Deviation)
```

#### c. **Vectorized Operations**
Operations like addition, subtraction, multiplication, etc., can be done directly on a Series.
```python
s = pd.Series([1, 2, 3])
print(s * 10)  # Each element will be multiplied by 10
```

---

### 5. **Series Indexing**

#### a. **Setting a Custom Index**
You can set a custom index using `.set_index()`.
```python
s = pd.Series([10, 20, 30], index=['x', 'y', 'z'])
print(s)
```

#### b. **Resetting Index**
You can reset the index to default (0, 1, 2…) using `.reset_index()`.
```python
s_reset = s.reset_index(drop=True)
print(s_reset)
```

#### c. **Checking for Index Existence**
You can check if a label exists in the index using `in`.
```python
print('y' in s)  # Output: True
```

#### d. **Reindexing a Series**
You can reindex a Series to add or remove labels using `.reindex()`.
```python
new_index = ['a', 'b', 'c', 'd']
s_reindexed = s.reindex(new_index)
print(s_reindexed)
```
Output:
```
a     NaN
b     NaN
c    20.0
d     NaN
dtype: float64
```

---

### 6. **Handling Missing Data in Series**

#### a. **Handling `NaN` (Not a Number) Values**
You can fill or drop missing values in a Series.
- **Filling Missing Values**: `.fillna()`
```python
s_filled = s_reindexed.fillna(0)
print(s_filled)
```

- **Dropping Missing Values**: `.dropna()`
```python
s_dropped = s_reindexed.dropna()
print(s_dropped)
```

---

### 7. **Combining Multiple Series**

#### a. **Concatenating Series**
You can concatenate multiple Series using `pd.concat()`.
```python
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
combined = pd.concat([s1, s2])
print(combined)
```

#### b. **Appending Series**
Series can also be appended using `.append()`.
```python
s_combined = s1.append(s2)
print(s_combined)
```

---

### 8. **Statistical Functions on Series**

Pandas Series has several useful statistical methods:
- **Describe**: Provides summary statistics for a Series.
```python
s = pd.Series([1, 2, 3, 4, 5])
print(s.describe())
```

- **Count, Min, Max**:
```python
print(s.count())  # Output: 5
print(s.min())    # Output: 1
print(s.max())    # Output: 5
```

- **Correlation and Covariance**:
```python
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
print(s1.corr(s2))  # Output: 1.0 (perfect correlation)
```

---

### 9. **Using `apply()` to Apply Functions on Series**
The `apply()` function allows applying custom functions element-wise.

Example with a lambda function:
```python
s = pd.Series([1, 2, 3])
print(s.apply(lambda x: x**2))
```

---

### 10. **Accessing Data in Series as Arrays**

#### a. **Using `.values` Attribute**
You can access the underlying NumPy array using `.values`.
```python
s = pd.Series([1, 2, 3])
print(s.values)  # Output: [1 2 3]
```

#### b. **Accessing the Index**
You can access the index (labels) of a Series using `.index`.
```python
print(s.index)
```

---

### 11. **Label Alignment and Broadcasting**
Operations between Series automatically align labels. If the labels do not match, Pandas fills with `NaN` for missing data.

```python
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2)
```
Output:
```
a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64
```

---

### Summary of Pandas Series Functionalities:

- **Creation**: From lists, dictionaries, scalars, with custom indices.
- **Accessing Data**: Using labels (`.loc[]`), positions (`.iloc[]`), and boolean masks.
- **Modifying Data**: Adding, updating, and dropping elements.
- **Mathematical Operations**: Element-wise operations, mathematical functions.
- **Handling Missing Data**: `.fillna()`, `.dropna()`.
- **

Indexing**: Custom index, resetting index, reindexing.
- **Combining Series**: Concatenation, appending.
- **Statistical Functions**: `describe()`, `mean()`, `corr()`.
- **Applying Functions**: Using `.apply()` for custom functions.

---
---
---


---
---
---
### **Pandas Labels**
* It is a one-dimensional array holding data of any type
* Pandas labels are unique identifiers associated with each element in a Series or DataFrame
* They provide a way to access and manipulate data based on meaningful names or categories rather than just integer indices


**Key Characteristics:**

* **Unique:** Each label within a Series or DataFrame must be unique.
* **Immutable:** Labels cannot be modified once assigned.
* **Data Type:** Labels can be of any data type (e.g., strings, integers, objects).
* **Indexing:** Labels are used for indexing and selection of data.
* **Alignment:** Labels are used for aligning Series and DataFrames during operations.


**Types of Labels:**

* **Integer labels:** Numeric indices used for traditional array-style access.
* **String labels:** Descriptive names or categories assigned to elements.
* **Datetime labels:** Timestamps used for time series data.
* **Custom labels:** Any immutable object that can be used as a unique identifier.

**Creating Labels:**

* **Automatic labeling:** When creating a Series or DataFrame from a list or dictionary, labels are automatically generated based on the index or keys.

* **Explicit labeling:** You can explicitly assign labels using the index attribute.
---

### 1. **Accessing Data Using Labels**

#### a. **Using `.loc[]` for Label-Based Indexing**

**Definition:**  
`.loc[]` is used to access a group of rows and columns by labels (index or column names).

**Syntax:**  
```python
DataFrame.loc[row_labels, column_labels]
```

**Example:**

```python
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}, index=['A', 'B', 'C'])

# Access data by label
print(df.loc['A', 'Name'])  # Output: 'Alice'
print(df.loc['B', :])       # Access all columns for row 'B'
```

**Output:**
```
Alice
Name          Bob
Age            30
City    San Francisco
Name: B, dtype: object
```

---

#### b. **Using `.iloc[]` for Integer-Based Indexing**

**Definition:**  
`.iloc[]` is used to access data by integer-based position (similar to NumPy).

**Syntax:**  
```python
DataFrame.iloc[row_index, column_index]
```

**Example:**

```python
# Access data by integer position
print(df.iloc[0, 0])  # Output: 'Alice' (First row, first column)
print(df.iloc[1, :])  # All columns for the second row (Bob)
```

**Output:**
```
Alice
Name            Bob
Age              30
City    San Francisco
Name: B, dtype: object
```

---

### 2. **Modifying Labels**

#### a. **Renaming Labels**

**Definition:**  
The `.rename()` function is used to rename index labels or column names.

**Syntax:**  
```python
DataFrame.rename(index={'old_label': 'new_label'}, columns={'old_col': 'new_col'})
```

**Example:**

```python
# Rename row label 'A' to 'Alpha' and column 'Name' to 'First Name'
df_renamed = df.rename(index={'A': 'Alpha'}, columns={'Name': 'First Name'})
print(df_renamed)
```

**Output:**
```
      First Name  Age           City
Alpha      Alice   25       New York
B            Bob   30  San Francisco
C        Charlie   35    Los Angeles
```

---

#### b. **Setting Index Labels with `.set_index()`**

**Definition:**  
`.set_index()` is used to set one of the columns as the DataFrame's index.

**Syntax:**  
```python
DataFrame.set_index(column_name)
```

**Example:**

```python
# Set 'Name' as the index
df_indexed = df.set_index('Name')
print(df_indexed)
```

**Output:**
```
          Age           City
Name                         
Alice      25       New York
Bob        30  San Francisco
Charlie    35    Los Angeles
```

---

#### c. **Resetting Index with `.reset_index()`**

**Definition:**  
The `.reset_index()` method resets the index to the default integer-based index, optionally keeping the old index as a column.

**Syntax:**  
```python
DataFrame.reset_index(drop=False)
```

**Example:**

```python
# Reset index to default and keep the previous index as a column
df_reset = df_indexed.reset_index()
print(df_reset)
```

**Output:**
```
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
```

---

### 3. **MultiIndex (Hierarchical Indexing)**

#### a. **Creating a MultiIndex**

**Definition:**  
A `MultiIndex` allows you to have multiple levels of index labels, which is useful for working with complex datasets.

**Syntax:**  
```python
pd.MultiIndex.from_tuples(list_of_tuples)
```

**Example:**

```python
# MultiIndex with two levels (Location, ID)
index = pd.MultiIndex.from_tuples([('New York', 'A'), ('San Francisco', 'B'), ('Los Angeles', 'C')], names=['City', 'ID'])
df_multi = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}, index=index)
print(df_multi)
```

**Output:**
```
                       Name  Age
City           ID                 
New York       A      Alice   25
San Francisco  B        Bob   30
Los Angeles    C    Charlie   35
```

---

#### b. **Accessing Data in MultiIndex**

**Definition:**  
You can access data at different levels of the MultiIndex using `.loc[]`.

**Example:**

```python
# Access data for a specific location
print(df_multi.loc['New York'])
```

**Output:**
```
   Name  Age
ID           
A  Alice   25
```

```python
# Access data for a specific (City, ID) combination
print(df_multi.loc[('New York', 'A')])
```

**Output:**
```
Name    Alice
Age        25
Name: (New York, A), dtype: object
```

---

### 4. **Handling Missing Labels**

#### a. **Reindexing a DataFrame**

**Definition:**  
`.reindex()` is used to conform a DataFrame to a new index, adding missing rows or columns as `NaN`.

**Syntax:**  
```python
DataFrame.reindex(new_labels)
```

**Example:**

```python
# Reindexing rows, adding new labels with NaN values
new_index = ['A', 'B', 'C', 'D']
df_reindexed = df.reindex(new_index)
print(df_reindexed)
```

**Output:**
```
     Name   Age           City
A   Alice  25.0       New York
B     Bob  30.0  San Francisco
C Charlie  35.0    Los Angeles
D     NaN   NaN            NaN
```

---

#### b. **Filling Missing Labels Using `.fillna()`**

**Definition:**  
`.fillna()` is used to fill missing values (`NaN`) with specific values.

**Syntax:**  
```python
DataFrame.fillna(value)
```

**Example:**

```python
# Fill NaN values with 0
df_filled = df_reindexed.fillna(0)
print(df_filled)
```

**Output:**
```
      Name   Age           City
A    Alice  25.0       New York
B      Bob  30.0  San Francisco
C  Charlie  35.0    Los Angeles
D        0   0.0              0
```

---

### 5. **Indexing with Boolean Masks**

**Definition:**  
You can filter data by creating a Boolean mask based on the labels or data in the DataFrame.

**Example:**

```python
# Filter rows where the 'Age' column is greater than 30
mask = df['Age'] > 30
print(df[mask])
```

**Output:**
```
      Name  Age         City
C  Charlie   35  Los Angeles
```

---

### 6. **Using `.at[]` and `.iat[]` for Fast Scalar Access**

#### a. **`.at[]`: Access by Label**

**Definition:**  
`.at[]` is used to access a single element using a label.

**Example:**

```python
# Access the element at row 'A' and column 'Name'
print(df.at['A', 'Name'])  # Output: 'Alice'
```

---

#### b. **`.iat[]`: Access by Integer Location**

**Definition:**  
`.iat[]` is used to access a single element using an integer location.

**Example:**

```python
# Access the element at row 0 and column 0
print(df.iat[0, 0])  # Output: 'Alice'
```

---

### 7. **Indexing with Conditions Based on Labels**

**Definition:**  
You can select specific columns or rows based on conditional labels.

**Example:**

```python
# Select columns based on a list of column names
print(df[['Name', 'Age']])  # Select 'Name' and 'Age' columns
```

**Output:**
```
      Name  Age
A    Alice   25
B      Bob   30
C  Charlie   35
```

---

### 8. **Using `.query()` for Label-Based Querying**

**Definition:**  
`.query()` allows querying the DataFrame using column labels with a more readable syntax.

**Example:**

```python
# Query rows where Age > 30
print(df.query('Age > 30'))
```

**Output:**
```
      Name  Age         City
C  Charlie   35  Los Angeles
```

---

### 9. **Label Alignment**

**Definition:**  
Operations between Series and DataFrames automatically align labels by index, making Pandas label-alignment-friendly.

**Example:**

```

python
# Automatic alignment by labels
s = pd.Series([1, 2, 3], index=['A', 'B', 'C'])
print(df['Age'] + s)  # Adds corresponding values by index labels
```

**Output:**
```
A    26
B    32
C    38
dtype: int64
```

---

### **Summary of Pandas Label Methods:**

- **Accessing**: `.loc[]`, `.iloc[]`, `.at[]`, `.iat[]`
- **Modifying**: `.rename()`, `.set_index()`, `.reset_index()`
- **MultiIndex**: Creating and accessing data with multiple labels.
- **Handling Missing Labels**: `.reindex()`, `.fillna()`
- **Filtering**: Boolean masks, `.query()`
- **Fast Access**: `.at[]`, `.iat[]` for scalar access.

---
---
---

---
---
---
### **Pandas DataFrame:**

* **Pandas DataFrame** is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).
* It is one of the most commonly used structures for data analysis in Python.



---

### 1. **Creating a Pandas DataFrame**

#### a. **From a Dictionary of Lists**
You can create a DataFrame from a dictionary where keys are column names and values are lists.
```python
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)
print(df)
```
**Output:**
```
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
```

#### b. **From a List of Dictionaries**
Each dictionary in the list represents a row.
```python
data = [{'A': 1, 'B': 2}, {'A': 3, 'B': 4, 'C': 5}]
df = pd.DataFrame(data)
print(df)
```
**Output:**
```
   A  B    C
0  1  2  NaN
1  3  4  5.0
```

#### c. **From a 2D NumPy Array**
You can create a DataFrame from a NumPy array, optionally with custom row and column labels.
```python
import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)
```
**Output:**
```
   A  B  C
0  1  2  3
1  4  5  6
```

---

### 2. **Accessing Data in DataFrame**

#### a. **Accessing Columns**
You can access a single column by its label.
```python
print(df['A'])
```
**Output:**
```
0    1
1    4
Name: A, dtype: int32
```

#### b. **Accessing Rows by Index (`.iloc[]`)**
Access rows by their index position.
```python
print(df.iloc[1])
```
**Output:**
```
A    4
B    5
C    6
Name: 1, dtype: int32
```

#### c. **Accessing Rows by Label (`.loc[]`)**
Access rows by their index labels (row names).
```python
df = pd.DataFrame(data, columns=['A', 'B', 'C'], index=['row1', 'row2'])
print(df.loc['row1'])
```
**Output:**
```
A    1
B    2
C    3
Name: row1, dtype: int32
```

---

### 3. **Modifying a DataFrame**

#### a. **Adding a New Column**
You can add a new column by assigning it a list or a scalar value.
```python
df['D'] = [10, 11]
print(df)
```
**Output:**
```
   A  B  C   D
0  1  2  3  10
1  4  5  6  11
```

#### b. **Dropping Columns or Rows**
Use `.drop()` to remove columns or rows.
```python
df = df.drop('C', axis=1)  # Drop column 'C'
print(df)
```
**Output:**
```
   A  B   D
0  1  2  10
1  4  5  11
```

---

### 4. **Filtering and Boolean Indexing**

#### a. **Filtering Rows Based on Conditions**
Filter rows based on a condition.
```python
print(df[df['A'] > 1])
```
**Output:**
```
   A  B   D
1  4  5  11
```

#### b. **Using `.isin()` to Filter**
* Check if values are in a list.
* Filters rows where a column contains specific values using
```python
print(df[df['B'].isin([2, 5])])
```
**Output:**
```
   A  B   D
0  1  2  10
1  4  5  11
```


---

### 5. **DataFrame Operations**

#### a. **Mathematical Operations**
* You can apply mathematical operations directly on DataFrame columns.
* Performs element-wise operations on DataFrame columns
```python
df['A_plus_B'] = df['A'] + df['B']
print(df)
```
**Output:**
```
   A  B   D  A_plus_B
0  1  2  10         3
1  4  5  11         9
```

#### b. **Applying Functions Row/Column-wise (`.apply()`)**
Apply a function to each row or column.
```python
df['double_A'] = df['A'].apply(lambda x: x * 2)
print(df)
```
**Output:**
```
   A  B   D  A_plus_B  double_A
0  1  2  10         3         2
1  4  5  11         9         8
```

---

### 6. **Grouping and Aggregating Data**

#### a. **Grouping Data (`.groupby()`)**
Group rows based on column values.
```python
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'], 'B': [1, 2, 3, 4]})
grouped = df.groupby('A').sum()
print(grouped)
```
**Output:**
```
       B
A       
bar    6
foo    4
```

#### b. **Aggregating Data (`.agg()`)**
Perform multiple aggregations on DataFrame columns.
```python
df = pd.DataFrame({'A': [1, 2, 3, 4]})
print(df.agg(['sum', 'mean']))
```
**Output:**
```
      A
sum   10
mean   2.5
```

---

### 7. **Merging, Joining, and Concatenating DataFrames**

#### a. **Concatenating DataFrames**
You can concatenate DataFrames vertically or horizontally using `pd.concat()`.
```python
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df_concat = pd.concat([df1, df2], axis=1)
print(df_concat)
```
**Output:**
```
   A  B
0  1  3
1  2  4
```

#### b. **Merging DataFrames (`pd.merge()`)**
Merge DataFrames based on common columns or indices.
```python
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value2': [3, 4]})
df_merged = pd.merge(df1, df2, on='key')
print(df_merged)
```
**Output:**
```
  key  value  value2
0   A      1       3
1   B      2       4
```

---

### 8. **Handling Missing Data**

#### a. **Filling Missing Values (`.fillna()`)**
Fill `NaN` values with a specific value.
```python
df = pd.DataFrame({'A': [1, None, 3]})
df_filled = df.fillna(0)
print(df_filled)
```
**Output:**
```
     A
0  1.0
1  0.0
2  3.0
```

#### b. **Dropping Missing Values (`.dropna()`)**
Remove rows containing `NaN` values.
```python
```python
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
df_dropped = df.dropna()
print(df_dropped)
```

**Output:**
```
     A    B
0  1.0  4.0
```

---

### 9. **Sorting Data**

#### a. **Sorting by Column Values (`.sort_values()`)**
Sort the DataFrame based on column values.
```python
df = pd.DataFrame({'A': [3, 1, 2]})
df_sorted = df.sort_values(by='A')
print(df_sorted)
```
**Output:**
```
   A
1  1
2  2
0  3
```

#### b. **Sorting by Index (`.sort_index()`)**
Sort the DataFrame based on the index.
```python
df = pd.DataFrame({'A': [1, 2, 3]}, index=[2, 0, 1])
df_sorted = df.sort_index()
print(df_sorted)
```
**Output:**
```
   A
0  2
1  3
2  1
```

---

### 10. **Statistical Functions**

#### a. **Descriptive Statistics (`.describe()`)**
Generate descriptive statistics for DataFrame columns.
```python
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.describe())
```
**Output:**
```
         A    B
count  3.0  3.0
mean   2.0  5.0
std    1.0  1.0
min    1.0  4.0
25%    1.5  4.5
50%    2.0  5.0
75%    2.5  5.5
max    3.0  6.0
```

#### b. **Correlation (`.corr()`)**
Computes the correlation between DataFrame columns.
```python
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.corr())
```
**Output:**
```
     A    B
A  1.0  1.0
B  1.0  1.0
```

---

### 11. **Pivoting and Reshaping**

#### a. **Pivoting Data (`.pivot_table()`)**
Create a pivot table based on the DataFrame.
```python
df = pd.DataFrame({'A': ['foo', 'bar', 'foo'], 'B': [1, 2, 3]})
pivot_df = df.pivot_table(values='B', index='A', aggfunc='mean')
print(pivot_df)
```
**Output:**
```
       B
A       
bar  2.0
foo  2.0
```

#### b. **Reshaping with `.melt()`**
Unpivots a DataFrame from wide format to long format.
```python
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
melted_df = pd.melt(df, id_vars='A', value_vars=['B'])
print(melted_df)
```
**Output:**
```
   A variable  value
0  1        B      3
1  2        B      4
```

---

### 12. **Indexing and Selecting Data**

#### a. **Resetting the Index (`.reset_index()`)**
Resets the index of the DataFrame to default.
```python
df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])
df_reset = df.reset_index()
print(df_reset)
```
**Output:**
```
  index  A
0     x  1
1     y  2
2     z  3
```

#### b. **Reindexing the DataFrame (`.reindex()`)**
Reindex the DataFrame to align with new row/column labels.
```python
df = pd.DataFrame({'A': [1, 2, 3]}, index=['a', 'b', 'c'])
df_reindexed = df.reindex(['c', 'b', 'a', 'd'])
print(df_reindexed)
```
**Output:**
```
     A
c  3.0
b  2.0
a  1.0
d  NaN
```

---

### 13. **Combining DataFrames**

#### a. **Appending Rows (`.append()`)**
Append rows of another DataFrame to the current DataFrame.
```python
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
df_appended = df1.append(df2)
print(df_appended)
```
**Output:**
```
   A
0  1
1  2
0  3
1  4
```

---

### 14. **Time Series Handling**

#### a. **Handling Dates and Times (`pd.to_datetime()`)**
Converts strings to datetime objects.
```python
df = pd.DataFrame({'A': ['2023-01-01', '2023-02-01']})
df['A'] = pd.to_datetime(df['A'])
print(df)
```
**Output:**
```
           A
0 2023-01-01
1 2023-02-01
```

#### b. **Resampling Time Series Data (`.resample()`)**
Resample time series data for different frequencies.
```python
df = pd.DataFrame({'A': pd.date_range('2023-01-01', periods=4, freq='D'), 'B': [1, 2, 3, 4]})
df_resampled = df.resample('2D', on='A').sum()
print(df_resampled)
```
**Output:**
```
            B
A             
2023-01-01  3
2023-01-03  7
```

---
---
---

---
---
---
### **Indexing and Selecting Data**
* In Pandas, indexing and selecting data refers to the process of accessing specific elements, rows, or columns within a Series or DataFrame
* It allows you to isolate and manipulate particular parts of your dataset


### 1. **Basic Indexing**
Indexing refers to selecting rows and columns from a DataFrame or Series. In Pandas, there are several ways to perform indexing:

- **Single column selection:** You can select a single column by using the column name in square brackets `[]`.
- **Multiple columns selection:** You can select multiple columns by passing a list of column names.
- **Row selection:** You can select rows by label or position using `.loc[]` and `.iloc[]`.

---

#### **Example: Single Column Selection**

```python
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# Select single column (by column name)
age_column = df['Age']
print(age_column)
```

**Output:**
```
0    25
1    30
2    35
Name: Age, dtype: int64
```

---

#### **Example: Multiple Columns Selection**

```python
# Select multiple columns
selected_columns = df[['Name', 'City']]
print(selected_columns)
```

**Output:**
```
      Name           City
0    Alice       New York
1      Bob  San Francisco
2  Charlie    Los Angeles
```

---

### 2. **Label-Based Indexing with `.loc[]`**
The `.loc[]` function is used for label-based indexing, meaning you select rows and columns based on their labels.

#### **Syntax:**
```python
DataFrame.loc[row_labels, column_labels]
```

- `row_labels`: Can be a single label, a list of labels, or a slice of labels.
- `column_labels`: Can be a single column name, a list of column names, or a slice of column names.

---

#### **Example: Label-Based Indexing with .loc[]**

```python
# Select rows and columns by label using .loc[]
selected_data = df.loc[0, 'Name']  # First row, "Name" column
print(selected_data)
```

**Output:**
```
Alice
```

---

#### **Example: Multiple Rows and Columns with .loc[]**

```python
# Select multiple rows and columns
selected_data = df.loc[0:1, ['Name', 'City']]
print(selected_data)
```

**Output:**
```
    Name           City
0  Alice       New York
1    Bob  San Francisco
```

---

### 3. **Position-Based Indexing with `.iloc[]`**
The `.iloc[]` function is used for integer-based indexing, where you select rows and columns by their positions (0-based index).

#### **Syntax:**
```python
DataFrame.iloc[row_index, column_index]
```

- `row_index`: Integer index for rows.
- `column_index`: Integer index for columns.

---

#### **Example: Position-Based Indexing with .iloc[]**

```python
# Select data using .iloc[] (by integer position)
selected_data = df.iloc[0, 1]  # First row, second column ("Age")
print(selected_data)
```

**Output:**
```
25
```

---

#### **Example: Multiple Rows and Columns with .iloc[]**

```python
# Select multiple rows and columns using .iloc[]
selected_data = df.iloc[0:2, 0:2]  # First two rows, first two columns
print(selected_data)
```

**Output:**
```
      Name  Age
0    Alice   25
1      Bob   30
```

---

### 4. **Boolean Indexing**
You can filter data using Boolean conditions. This is called Boolean Indexing, where the condition returns a Boolean mask (True/False) for filtering.

#### **Example: Boolean Indexing**

```python
# Boolean indexing to filter rows where Age is greater than 30
filtered_data = df[df['Age'] > 30]
print(filtered_data)
```

**Output:**
```
      Name  Age         City
2  Charlie   35  Los Angeles
```

---

### 5. **Using `.at[]` and `.iat[]` for Fast Scalar Access**
- **`.at[]`**: Fast label-based access for a single element.
- **`.iat[]`**: Fast position-based access for a single element.

#### **Example: Using `.at[]` and `.iat[]`**

```python
# Access single value using .at[] (label-based)
name_value = df.at[0, 'Name']
print(name_value)

# Access single value using .iat[] (position-based)
age_value = df.iat[0, 1]
print(age_value)
```

**Output:**
```
Alice
25
```

---

### 6. **Slicing Data**
You can slice rows and columns in Pandas similarly to how you slice lists or arrays in Python.

#### **Example: Slicing Rows**

```python
# Slice rows from index 1 to 2 (inclusive)
row_slice = df[1:3]
print(row_slice)
```

**Output:**
```
      Name  Age           City
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
```

---

### 7. **Setting Data Using Labels or Indexes**
You can set values in a DataFrame by using `.loc[]` or `.iloc[]`.

#### **Example: Setting Values**

```python
# Set value using label-based indexing
df.loc[0, 'Age'] = 26
print(df)

# Set value using integer-based indexing
df.iloc[1, 2] = 'SF'
print(df)
```

**Output:**
```
      Name  Age         City
0    Alice   26      New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
```

After setting value with `.iloc[]`:
```
      Name  Age         City
0    Alice   26      New York
1      Bob   30            SF
2  Charlie   35    Los Angeles
```

---

### 8. **Indexing with Conditions (Boolean Masking)**
You can use conditions to filter rows and select data that meets the condition.

#### **Example: Conditional Selection**

```python
# Select rows where Age > 30
filtered_data = df[df['Age'] > 30]
print(filtered_data)
```

**Output:**
```
      Name  Age         City
2  Charlie   35  Los Angeles
```

---

### 9. **Using `.query()` for Label-Based Querying**
The `.query()` method allows querying a DataFrame using expressions based on column labels.

#### **Example: Using `.query()`**

```python
# Query rows where Age > 30
query_result = df.query('Age > 30')
print(query_result)
```

**Output:**
```
      Name  Age         City
2  Charlie   35  Los Angeles
```

---

### 10. **MultiIndex (Hierarchical Indexing)**
Pandas allows for a MultiIndex, which is a way to have multiple levels of indexing, useful for more complex datasets.

#### **Example: Creating a MultiIndex**

```python
# Create a DataFrame with a MultiIndex
index = pd.MultiIndex.from_tuples([('NY', 'A'), ('SF', 'B'), ('LA', 'C')], names=['City', 'ID'])
df_multi = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}, index=index)
print(df_multi)
```

**Output:**
```
             Name  Age
City ID              
NY   A      Alice   25
SF   B        Bob   30
LA   C    Charlie   35
```

#### **Accessing Data with MultiIndex**

```python
# Access data by MultiIndex
print(df_multi.loc['NY'])  # All data for 'NY'
print(df_multi.loc[('SF', 'B')])  # Specific (City, ID) combination
```

**Output:**
```
Name    Alice
Age        25
Name: A, dtype: object

Name    Bob
Age      30
Name: B, dtype: object
```

---

### 11. **Handling Missing Data (Null Values)**
Handling missing data is crucial. You can use `.isnull()`, `.notnull()`, `.fillna()`, and `.dropna()` to handle missing values.

#### **Example: Handling Missing Data**

```python
df_with_nan = df.reindex([0, 1, 2, 3])
print(df_with_nan)

# Fill NaN values
df_filled = df_with_nan.fillna(0)
print(df_filled)
```

**Output:**
```
      Name   Age         City
0    Alice  26.0      New York
1      Bob  30.0  San Francisco
2  Charlie  35.0    Los Angeles
3      NaN   NaN           NaN

      Name   Age         City
0    Alice  26.0      New York
1      Bob  30.0  San Francisco
2  Charlie  35.0    Los Angeles
3        0   0.0             0
```

---

### Summary
This section covered:
- Basic indexing (`[]`)
- Label-based indexing (`.loc[]`)
- Position-based indexing (`.iloc[]`)
- Boolean indexing
- Scalar access (`.at[]`, `.iat[]`)
- Slicing
- Setting values
- Conditional selection
- MultiIndexing
- Handling missing data

---
---
---

### **Data Types in Pandas**
* Pandas supports a variety of data types, which are essential for effective data analysis and manipulation.

### 1. **Understanding Data Types (`dtypes`)**
In Pandas, each column in a DataFrame is assigned a specific data type. Pandas supports various data types like integers, floats, strings (objects), datetime, and more. The `.dtypes` attribute shows the data types of all columns.

#### **Example: Checking Data Types**

```python
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [70000.0, 80000.0, 120000.0]}
df = pd.DataFrame(data)

# Check data types
print(df.dtypes)
```

**Output:**
```
Name       object
Age         int64
Salary    float64
dtype: object
```

---

### 2. **Converting Data Types (`astype()`)**
You can explicitly convert data from one type to another using `.astype()`. This is useful when handling data that comes in an incorrect type (e.g., numbers as strings).

#### **Example: Convert Data Types**

```python
# Convert Age to float
df['Age'] = df['Age'].astype(float)
print(df.dtypes)
```

**Output:**
```
Name       object
Age       float64
Salary    float64
dtype: object
```

#### **Example: Convert String to Integer**

```python
# Create DataFrame with string numbers
data = {'ID': ['1', '2', '3'], 'Value': ['10', '20', '30']}
df_str = pd.DataFrame(data)

# Convert ID and Value columns to integer type
df_str['ID'] = df_str['ID'].astype(int)
df_str['Value'] = df_str['Value'].astype(int)
print(df_str.dtypes)
```

**Output:**
```
ID       int64
Value    int64
dtype: object
```

---

### 3. **Automatic Type Inference**
Pandas automatically infers the data type when a DataFrame or Series is created, but sometimes you may need to check the types and ensure they're correct.

#### **Example: Automatic Type Detection**

```python
# Creating a DataFrame with mixed data types
data = {'Name': ['Alice', 'Bob'], 'Age': [25, '30'], 'Salary': [60000, '75000']}
df_mixed = pd.DataFrame(data)

# Check the inferred types
print(df_mixed.dtypes)
```

**Output:**
```
Name      object
Age       object
Salary    object
dtype: object
```

---

### 4. **Categorical Data Type**
Categorical data is a type that stores limited, predefined values. This reduces memory usage and speeds up operations for certain columns, especially for large datasets with repeated values.

#### **Example: Using Categorical Data**

```python
# Create a DataFrame with categorical data
df['Department'] = pd.Categorical(['HR', 'Finance', 'IT'])
print(df['Department'].dtype)
```

**Output:**
```
category
```

#### **Example: Convert Column to Categorical**

```python
# Convert existing column to categorical type
df['Name'] = df['Name'].astype('category')
print(df.dtypes)
```

**Output:**
```
Name       category
Age        float64
Salary     float64
dtype: object
```

---

### 5. **Datetime Data Type (`datetime64`)**
The `datetime64` data type allows you to work with dates and times efficiently. You can convert a column to a datetime type using `pd.to_datetime()`.

#### **Example: Converting to Datetime**

```python
# Create a DataFrame with date strings
data = {'Event': ['Event1', 'Event2'], 'Date': ['2023-01-01', '2024-01-01']}
df_dates = pd.DataFrame(data)

# Convert 'Date' column to datetime64
df_dates['Date'] = pd.to_datetime(df_dates['Date'])
print(df_dates.dtypes)
```

**Output:**
```
Event            object
Date     datetime64[ns]
dtype: object
```

#### **Example: Datetime Operations**

```python
# Extract the year and month from a datetime column
df_dates['Year'] = df_dates['Date'].dt.year
df_dates['Month'] = df_dates['Date'].dt.month
print(df_dates)
```

**Output:**
```
    Event       Date  Year  Month
0  Event1 2023-01-01  2023      1
1  Event2 2024-01-01  2024      1
```

---

### 6. **Handling Missing Data Types (`NaN`)**
Missing data in Pandas is represented by `NaN` (Not a Number). It is important to understand that `NaN` values have the float64 data type, even if they are in a column with integers.

#### **Example: Handling Missing Data**

```python
import numpy as np

# Create DataFrame with NaN values
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, np.nan, 35]}
df_nan = pd.DataFrame(data)

# Check data types
print(df_nan.dtypes)
```

**Output:**
```
Name     object
Age     float64
dtype: object
```

#### **Example: Fill Missing Values**

```python
# Fill NaN values with a default value
df_filled = df_nan.fillna(30)
print(df_filled)
```

**Output:**
```
      Name   Age
0    Alice  25.0
1      Bob  30.0
2  Charlie  35.0
```

---

### 7. **Object Data Type**
The `object` data type in Pandas is used for storing strings or mixed data types. If a column contains both numeric and non-numeric values, it will default to the object data type.

#### **Example: Object Type for Strings**

```python
# Create a DataFrame with strings
data = {'Name': ['Alice', 'Bob'], 'City': ['New York', 'San Francisco']}
df_strings = pd.DataFrame(data)

# Check data types
print(df_strings.dtypes)
```

**Output:**
```
Name    object
City    object
dtype: object
```

---

### 8. **Float Data Type (`float64`)**
The float data type is used for decimal numbers. If a column contains numbers with decimals, Pandas will infer it as `float64`.

#### **Example: Float Data Type**

```python
# Create a DataFrame with float numbers
data = {'Name': ['Alice', 'Bob'], 'Salary': [75000.0, 85000.0]}
df_float = pd.DataFrame(data)

# Check data types
print(df_float.dtypes)
```

**Output:**
```
Name      object
Salary    float64
dtype: object
```

---

### 9. **Integer Data Type (`int64`)**
The integer data type is used for whole numbers. Pandas assigns columns with whole numbers to the `int64` type.

#### **Example: Integer Data Type**

```python
# Create a DataFrame with integer numbers
data = {'ID': [101, 102, 103], 'Age': [25, 30, 35]}
df_int = pd.DataFrame(data)

# Check data types
print(df_int.dtypes)
```

**Output:**
```
ID     int64
Age    int64
dtype: object
```

---

### 10. **Nullable Integer Data Type (`Int64`)**
Pandas supports nullable integer types (`Int64`), which allow columns to have `NaN` (missing values) and integer values.

#### **Example: Nullable Integer**

```python
# Create a DataFrame with NaN in integer column
data = {'ID': [101, np.nan, 103]}
df_nullable_int = pd.DataFrame(data, dtype='Int64')

# Check data types
print(df_nullable_int.dtypes)
```

**Output:**
```
ID    Int64
dtype: object
```

---

### 11. **Checking Memory Usage (`memory_usage()`)**
You can check the memory usage of each column in a DataFrame using the `.memory_usage()` method. This is useful for optimizing data types.

#### **Example: Checking Memory Usage**

```python
# Check memory usage of DataFrame
memory_usage = df.memory_usage(deep=True)
print(memory_usage)
```

**Output:**
```
Index      128
Name        90
Age         24
Salary      24
dtype: int64
```

---

### Summary of Key Methods for Data Types:
- **Checking Data Types:** `.dtypes`
- **Converting Data Types:** `.astype()`
- **Handling Missing Values:** `.fillna()`, `.dropna()`
- **Categorical Data:** `pd.Categorical()`, `astype('category')`
- **Datetime Data:** `pd.to_datetime()`, `.dt` accessor for extracting components
- **Nullable Integers:** `Int64`
- **Memory Optimization:** `.memory_usage()`


---
---
---
### **Reading and Writing Data (CSV, Excel, SQL, etc.)**
---

### 1. **Reading CSV Files (`pd.read_csv()`)**
Pandas can read CSV files using the `pd.read_csv()` function. This is one of the most commonly used functions for importing data into Pandas.

#### **Example: Reading a CSV File**

```python
import pandas as pd

# Reading a CSV file
df = pd.read_csv('data.csv')

# Display the first few rows
print(df.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 2. **Writing to CSV Files (`df.to_csv()`)**
You can write DataFrame data to a CSV file using the `to_csv()` method. By default, it writes data with an index column.

#### **Example: Writing to a CSV File**

```python
# Writing DataFrame to CSV without index
df.to_csv('output.csv', index=False)
```

**Output:**
A CSV file `output.csv` is created without the index column.

---

### 3. **Reading Excel Files (`pd.read_excel()`)**
Pandas can read Excel files using the `pd.read_excel()` function. You can specify the sheet name if the Excel file contains multiple sheets.

#### **Example: Reading an Excel File**

```python
# Reading an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Display the first few rows
print(df_excel.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 4. **Writing to Excel Files (`df.to_excel()`)**
You can write data to an Excel file using `to_excel()`. Similar to CSV, you can control whether to include the index and specify the sheet name.

#### **Example: Writing to an Excel File**

```python
# Writing DataFrame to Excel without index
df.to_excel('output.xlsx', index=False, sheet_name='Employees')
```

**Output:**
An Excel file `output.xlsx` is created with the DataFrame data on the sheet "Employees".

---

### 5. **Reading SQL Databases (`pd.read_sql()`)**
Pandas can read data from SQL databases using `pd.read_sql()`. You need a connection object to the database and a valid SQL query.

#### **Example: Reading from an SQL Database**

```python
import sqlite3

# Create a connection to the database
conn = sqlite3.connect('example.db')

# Query the SQL database
df_sql = pd.read_sql('SELECT * FROM employees', conn)

# Display the first few rows
print(df_sql.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 6. **Writing to SQL Databases (`df.to_sql()`)**
You can write data to an SQL database using `to_sql()`. You'll need to pass the table name and connection object.

#### **Example: Writing to an SQL Database**

```python
# Writing DataFrame to SQL table
df.to_sql('employees', conn, if_exists='replace', index=False)
```

**Output:**
The data from the DataFrame is written to the `employees` table in the SQL database.

---

### 7. **Reading JSON Files (`pd.read_json()`)**
Pandas can read data from JSON files using `pd.read_json()`. This is useful for working with web data or APIs.

#### **Example: Reading a JSON File**

```python
# Reading a JSON file
df_json = pd.read_json('data.json')

# Display the first few rows
print(df_json.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 8. **Writing to JSON Files (`df.to_json()`)**
You can write a DataFrame to a JSON file using the `to_json()` method. You can control the format of the JSON output (e.g., `records`, `split`).

#### **Example: Writing to a JSON File**

```python
# Writing DataFrame to JSON
df.to_json('output.json', orient='records')
```

**Output:**
A JSON file `output.json` is created with the DataFrame data in the "records" format.

---

### 9. **Reading HTML Tables (`pd.read_html()`)**
Pandas can extract tables from HTML files or web pages using the `pd.read_html()` function. It returns a list of DataFrames if multiple tables are present.

#### **Example: Reading an HTML Table**

```python
# Reading tables from an HTML file
df_html_list = pd.read_html('data.html')

# Select the first table
df_html = df_html_list[0]

# Display the first few rows
print(df_html.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 10. **Writing to HTML Files (`df.to_html()`)**
You can write a DataFrame to an HTML file using `to_html()`, which converts the DataFrame into an HTML table.

#### **Example: Writing to an HTML File**

```python
# Writing DataFrame to HTML
df.to_html('output.html', index=False)
```

**Output:**
An HTML file `output.html` is created with the DataFrame data as an HTML table.

---

### 11. **Reading from a Clipboard (`pd.read_clipboard()`)**
Pandas can read data that has been copied to your system clipboard using the `pd.read_clipboard()` function. This is convenient for quick data sharing from spreadsheets or websites.

#### **Example: Reading from Clipboard**

```python
# Reading data from clipboard
df_clipboard = pd.read_clipboard()

# Display the first few rows
print(df_clipboard.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 12. **Writing to a Clipboard (`df.to_clipboard()`)**
You can write DataFrame data to the clipboard using `to_clipboard()`. This is useful for quickly pasting data into other applications.

#### **Example: Writing to Clipboard**

```python
# Writing DataFrame to clipboard
df.to_clipboard(index=False)
```

**Output:**
The DataFrame data is now copied to your system clipboard and can be pasted into a spreadsheet or text editor.

---

### 13. **Reading Pickle Files (`pd.read_pickle()`)**
Pandas can read a DataFrame from a pickle file using `pd.read_pickle()`. This is a binary format specific to Python, which is faster than CSV for large datasets.

#### **Example: Reading a Pickle File**

```python
# Reading a pickle file
df_pickle = pd.read_pickle('data.pkl')

# Display the first few rows
print(df_pickle.head())
```

**Output:**
```
   ID    Name  Age   Salary
0   1   Alice   25   70000
1   2     Bob   30   80000
2   3  Charlie  35  120000
```

---

### 14. **Writing to Pickle Files (`df.to_pickle()`)**
You can write DataFrame data to a pickle file using `to_pickle()`. This is useful for saving data quickly in a format that can be loaded back into Pandas.

#### **Example: Writing to a Pickle File**

```python
# Writing DataFrame to a pickle file
df.to_pickle('output.pkl')
```

**Output:**
A pickle file `output.pkl` is created with the DataFrame data.

---

### 15. **Reading and Writing Parquet Files**
Pandas supports the Parquet format, which is a columnar storage file format optimized for large-scale data processing.

#### **Example: Reading a Parquet File**

```python
# Reading a Parquet file
df_parquet = pd.read_parquet('data.parquet')

# Display the first few rows
print(df_parquet.head())
```

#### **Example: Writing to a Parquet File**

```python
# Writing DataFrame to Parquet
df.to_parquet('output.parquet')
```

---

### Summary of Key Methods for Reading and Writing Data:
- **CSV Files:** `pd.read_csv()`, `df.to_csv()`
- **Excel Files:** `pd.read_excel()`, `df.to_excel()`
- **SQL Databases:** `pd.read_sql()`, `df.to_sql()`
- **

JSON Files:** `pd.read_json()`, `df.to_json()`
- **HTML Tables:** `pd.read_html()`, `df.to_html()`
- **Clipboard:** `pd.read_clipboard()`, `df.to_clipboard()`
- **Pickle Files:** `pd.read_pickle()`, `df.to_pickle()`
- **Parquet Files:** `pd.read_parquet()`, `df.to_parquet()`

---
---
---


---
---
---
### **Data Cleaning and Preprocessing**
* Data cleaning and preprocessing are essential steps in data analysis to ensure data quality and consistency
* Pandas provides a rich set of tools to handle these tasks effectively.

---

### 1. **Handling Missing Data**

Missing data is common in real-world datasets. Pandas provides several methods to handle missing values effectively.

#### **1.1. Identifying Missing Data (`isna()`, `notna()`)**

- **`isna()`**: Returns a DataFrame of Boolean values, where `True` indicates a missing value.
- **`notna()`**: Returns the opposite of `isna()`, where `True` indicates a non-missing value.

#### **Example: Checking for Missing Data**

```python
import pandas as pd

# Sample data with missing values
data = {'Name': ['Alice', 'Bob', None],
        'Age': [25, None, 35],
        'Salary': [70000, 80000, None]}

df = pd.DataFrame(data)

# Identifying missing data
print(df.isna())
```

**Output:**
```
    Name    Age  Salary
0  False  False   False
1  False   True   False
2   True  False    True
```

---

#### **1.2. Dropping Missing Data (`dropna()`)**

- **`dropna()`**: Removes rows or columns with missing data. You can specify to drop rows (`axis=0`) or columns (`axis=1`).

#### **Example: Dropping Rows with Missing Data**

```python
# Dropping rows with missing values
df_dropped = df.dropna()

print(df_dropped)
```

**Output:**
```
    Name   Age   Salary
0  Alice  25.0  70000.0
```

#### **Example: Dropping Columns with Missing Data**

```python
# Dropping columns with missing values
df_dropped_col = df.dropna(axis=1)

print(df_dropped_col)
```

**Output:**
```
    Name
0  Alice
1    Bob
2   None
```

---

#### **1.3. Filling Missing Data (`fillna()`)**

- **`fillna()`**: Fills missing values with a specified value or method (like forward-fill, backward-fill).

#### **Example: Filling Missing Data with a Specific Value**

```python
# Filling missing data with 0
df_filled = df.fillna(0)

print(df_filled)
```

**Output:**
```
    Name   Age   Salary
0  Alice  25.0  70000.0
1    Bob   0.0  80000.0
2    0.0  35.0      0.0
```

#### **Example: Forward Filling Missing Data**

```python
# Forward fill (propagate the next valid value forward)
df_ffill = df.fillna(method='ffill')

print(df_ffill)
```

**Output:**
```
    Name   Age   Salary
0  Alice  25.0  70000.0
1    Bob  25.0  80000.0
2    Bob  35.0  80000.0
```

---

#### **1.4. Interpolating Missing Data (`interpolate()`)**

- **`interpolate()`**: Fills missing values by interpolating between existing data points.

#### **Example: Interpolating Missing Data**

```python
# Interpolating missing data
df_interpolated = df.interpolate()

print(df_interpolated)
```

**Output:**
```
    Name   Age   Salary
0  Alice  25.0  70000.0
1    Bob  30.0  80000.0
2    Bob  35.0  80000.0
```

---

### 2. **Handling Duplicates (`drop_duplicates()`)**

Duplicate rows can be a problem when processing data. Pandas provides the `drop_duplicates()` function to remove them.

#### **Example: Dropping Duplicate Rows**

```python
# Sample data with duplicates
data_dup = {'Name': ['Alice', 'Bob', 'Bob'],
            'Age': [25, 30, 30],
            'Salary': [70000, 80000, 80000]}

df_dup = pd.DataFrame(data_dup)

# Dropping duplicate rows
df_unique = df_dup.drop_duplicates()

print(df_unique)
```

**Output:**
```
    Name  Age  Salary
0  Alice   25   70000
1    Bob   30   80000
```

---

### 3. **Data Transformation**

#### **3.1. Replacing Values (`replace()`)**

You can replace specific values in the DataFrame using `replace()`.

#### **Example: Replacing Specific Values**

```python
# Replace all occurrences of 70000 with 75000
df_replaced = df.replace(70000, 75000)

print(df_replaced)
```

**Output:**
```
    Name   Age   Salary
0  Alice  25.0  75000.0
1    Bob   NaN  80000.0
2   None  35.0      NaN
```

---

#### **3.2. Renaming Columns (`rename()`)**

You can rename columns using `rename()`.

#### **Example: Renaming a Column**

```python
# Rename the 'Salary' column to 'Income'
df_renamed = df.rename(columns={'Salary': 'Income'})

print(df_renamed)
```

**Output:**
```
    Name   Age   Income
0  Alice  25.0  70000.0
1    Bob   NaN  80000.0
2   None  35.0      NaN
```

---

### 4. **Changing Data Types (`astype()`)**

You may need to convert the data type of a column. The `astype()` function allows you to change the data type.

#### **Example: Changing the Data Type of a Column**

```python
# Convert the 'Age' column to an integer
df_converted = df.astype({'Age': 'Int64'})

print(df_converted)
```

**Output:**
```
    Name  Age   Salary
0  Alice   25  70000.0
1    Bob  <NA>  80000.0
2   None   35     NaN
```

---

### 5. **Binning and Categorizing Data (`cut()`, `qcut()`)**

You can group continuous data into discrete intervals using `cut()` or `qcut()`.

#### **Example: Binning Data into Intervals**

```python
# Binning ages into categories
bins = [0, 18, 35, 60]
labels = ['Teen', 'Adult', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df)
```

**Output:**
```
    Name   Age   Salary Age_Group
0  Alice  25.0  70000.0     Adult
1    Bob   NaN  80000.0      NaN
2   None  35.0      NaN     Adult
```

---

### 6. **Scaling and Normalizing Data**

Pandas can normalize data, although libraries like `scikit-learn` are typically used. You can perform simple scaling using `apply()`.

#### **Example: Scaling Salary by Dividing by 1000**

```python
# Scale Salary column by dividing by 1000
df['Salary_Scaled'] = df['Salary'] / 1000

print(df)
```

**Output:**
```
    Name   Age   Salary Age_Group  Salary_Scaled
0  Alice  25.0  70000.0     Adult         70.0
1    Bob   NaN  80000.0      NaN         80.0
2   None  35.0      NaN     Adult          NaN
```

---

### 7. **String Operations**

String data in a DataFrame can be manipulated using `str` accessor methods.

#### **Example: Converting to Lowercase**

```python
# Converting 'Name' column to lowercase
df['Name_Lower'] = df['Name'].str.lower()

print(df)
```

**Output:**
```
    Name   Age   Salary Age_Group  Salary_Scaled Name_Lower
0  Alice  25.0  70000.0     Adult         70.0      alice
1    Bob   NaN  80000.0      NaN         80.0        bob
2   None  35.0      NaN     Adult          NaN       None
```

---

### 8. **Outlier Detection and Removal**

Detecting and removing outliers is important during preprocessing. One approach is to use interquartile ranges (IQR).

#### **Example: Detecting Outliers Using IQR**

```python
# Calculate the IQR for the Salary column
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers as anything outside 1.5 * IQR
df_outliers = df[~((df['Salary'] < (Q1 - 1.5 * IQR)) | (df['Salary'] > (Q3 + 1

.5 * IQR)))]

print(df_outliers)
```

**Output:**
```
    Name   Age   Salary Age_Group  Salary_Scaled Name_Lower
0  Alice  25.0  70000.0     Adult         70.0      alice
1    Bob   NaN  80000.0      NaN         80.0        bob
```

---
---
---

---
---
---
### **Handling Missing Data**

* Handling missing data is a crucial aspect of data analysis, as missing values can lead to inaccurate results and insights. * Pandas provides several methods to identify, handle, and fill missing data.

### 1. Identifying Missing Data
• **Definition**: Missing data can be identified using functions that check for null or NaN (Not a Number) values.
• **Methods**:
  - `isnull()`: Returns a DataFrame of the same shape as the original, with `True` for missing values and `False` for non-missing values.
  - `notnull()`: Returns the opposite of `isnull()`, indicating non-missing values.

**Example**:
```python
import pandas as pd
import numpy as np

# Sample DataFrame
data = {'A': [1, 2, np.nan], 'B': [4, np.nan, 6]}
df = pd.DataFrame(data)

# Identifying missing data
missing_data = df.isnull()
print(missing_data)
```
**Output**:
```
       A      B
0  False  False
1  False   True
2   True  False
```

### 2. Dropping Missing Data
• **Definition**: Removing rows or columns that contain missing values.
• **Methods**:
  - `dropna()`: Removes missing values based on specified criteria.
    - `axis=0`: Drop rows with missing values.
    - `axis=1`: Drop columns with missing values.
    - `how='any'`: Drop if any value is missing.
    - `how='all'`: Drop if all values are missing.

**Example**:
```python
# Dropping rows with any missing values
df_dropped_rows = df.dropna(axis=0, how='any')
print(df_dropped_rows)

# Dropping columns with any missing values
df_dropped_columns = df.dropna(axis=1, how='any')
print(df_dropped_columns)
```
**Output**:
```
   A    B
0  1.0  4.0
```
```
     A
0  1.0
1  2.0
```

### 3. Filling Missing Data
• **Definition**: Replacing missing values with a specified value or method.
• **Methods**:
  - `fillna()`: Fill missing values with a specified value, method, or forward/backward fill.
    - `value`: Fill with a specific value.
    - `method='ffill'`: Forward fill.
    - `method='bfill'`: Backward fill.

**Example**:
```python
# Filling missing values with a specific value
df_filled_value = df.fillna(0)
print(df_filled_value)

# Forward filling missing values
df_filled_ffill = df.fillna(method='ffill')
print(df_filled_ffill)

# Backward filling missing values
df_filled_bfill = df.fillna(method='bfill')
print(df_filled_bfill)
```
**Output**:
```
     A    B
0  1.0  4.0
1  2.0  0.0
2  0.0  6.0
```
```
     A    B
0  1.0  4.0
1  2.0  4.0
2  2.0  6.0
```
```
     A    B
0  1.0  4.0
1  2.0  6.0
2  2.0  6.0
```

### 4. Interpolating Missing Data
• **Definition**: Estimating missing values based on other available data points.
• **Method**:
  - `interpolate()`: Fills missing values using interpolation methods (linear, polynomial, etc.).

**Example**:
```python
# Interpolating missing values
df_interpolated = df.interpolate()
print(df_interpolated)
```
**Output**:
```
     A    B
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
```

### 5. Replacing Missing Data
• **Definition**: Replacing missing values with another value or method.
• **Method**:
  - `replace()`: Replace specified values with another value.

**Example**:
```python
# Replacing NaN with a specific value
df_replaced = df.replace(np.nan, -1)
print(df_replaced)
```
**Output**:
```
     A    B
0  1.0  4.0
1  2.0 -1.0
2 -1.0  6.0
```

### 6. Checking for Missing Data
• **Definition**: Summarizing the count of missing values in the DataFrame.
• **Method**:
  - `isnull().sum()`: Returns the count of missing values for each column.

**Example**:
```python
# Checking for missing data
missing_count = df.isnull().sum()
print(missing_count)
```
**Output**:
```
A    1
B    1
dtype: int64
```

### 7. Advanced Techniques
• **Using `pd.Series` with `isna()`**: Similar to `isnull()`, but can be used for Series objects.
• **Custom Functions**: You can define custom functions to handle missing data based on specific business logic.

---
---
---



### **Filtering and Sorting data**

* Filtering and sorting data in Pandas are essential operations for data analysis, allowing you to extract specific subsets of data and arrange them in a meaningful order.

### 1. Filtering Data
• **Definition**: Extracting rows from a DataFrame based on specific conditions.
• **Methods**:
  - Boolean indexing: Using boolean conditions to filter rows.
  - `query()`: A method that allows filtering using a query string.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 30, 22, 35],
    'Salary': [50000, 60000, 45000, 70000]
}
df = pd.DataFrame(data)

# Filtering using boolean indexing
filtered_age = df[df['Age'] > 25]
print(filtered_age)

# Filtering using query()
filtered_query = df.query('Salary > 55000')
print(filtered_query)
```
**Output**:
```
    Name  Age  Salary
1    Bob   30   60000
3  David   35   70000
```
```
    Name  Age  Salary
1    Bob   30   60000
3  David   35   70000
```

### 2. Filtering with Multiple Conditions
• **Definition**: Applying multiple conditions to filter data using logical operators.
• **Methods**:
  - Using `&` (and), `|` (or) for combining conditions.

**Example**:
```python
# Filtering with multiple conditions
filtered_multiple = df[(df['Age'] > 25) & (df['Salary'] > 55000)]
print(filtered_multiple)
```
**Output**:
```
    Name  Age  Salary
1    Bob   30   60000
3  David   35   70000
```

### 3. Sorting Data
• **Definition**: Arranging the rows of a DataFrame based on the values in one or more columns.
• **Methods**:
  - `sort_values()`: Sorts the DataFrame by specified column(s).
    - `by`: Column name(s) to sort by.
    - `ascending`: Boolean to specify ascending or descending order.

**Example**:
```python
# Sorting by a single column
sorted_by_age = df.sort_values(by='Age')
print(sorted_by_age)

# Sorting by multiple columns
sorted_by_multiple = df.sort_values(by=['Salary', 'Age'], ascending=[True, False])
print(sorted_by_multiple)
```
**Output**:
```
    Name  Age  Salary
2  Charlie   22   45000
0    Alice   24   50000
1      Bob   30   60000
3    David   35   70000
```
```
    Name  Age  Salary
2  Charlie   22   45000
0    Alice   24   50000
1      Bob   30   60000
3    David   35   70000
```

### 4. Sorting by Index
• **Definition**: Sorting the DataFrame based on its index.
• **Method**:
  - `sort_index()`: Sorts the DataFrame by its index.

**Example**:
```python
# Sample DataFrame with custom index
df_indexed = df.set_index('Name')

# Sorting by index
sorted_by_index = df_indexed.sort_index()
print(sorted_by_index)
```
**Output**:
```
          Age  Salary
Name                 
Alice      24   50000
Bob        30   60000
Charlie    22   45000
David      35   70000
```

### 5. Sorting with NaN Values
• **Definition**: Handling NaN values while sorting.
• **Method**:
  - `na_position`: Specifies whether NaN values should be placed at the beginning or end.

**Example**:
```python
# Sample DataFrame with NaN values
data_nan = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, None, 22, 35],
    'Salary': [50000, 60000, None, 70000]
}
df_nan = pd.DataFrame(data_nan)

# Sorting with NaN values at the end
sorted_nan = df_nan.sort_values(by='Age', na_position='last')
print(sorted_nan)
```
**Output**:
```
      Name   Age   Salary
0    Alice  24.0  50000.0
2  Charlie  22.0      NaN
3    David  35.0  70000.0
1      Bob   NaN  60000.0
```

### 6. Resetting Index After Sorting
• **Definition**: Resetting the index of a DataFrame after sorting.
• **Method**:
  - `reset_index()`: Resets the index of the DataFrame.

**Example**:
```python
# Resetting index after sorting
sorted_reset_index = df.sort_values(by='Age').reset_index(drop=True)
print(sorted_reset_index)
```
**Output**:
```
      Name   Age  Salary
0  Charlie  22.0     NaN
1    Alice  24.0  50000.0
2      Bob  30.0  60000.0
3    David  35.0  70000.0
```

----
----
----

### **Merging , Joining , Concatenating data**

* Merging, joining, and concatenating data in Pandas are essential operations for combining multiple DataFrames into a single DataFrame. Each method serves different purposes and is used in various scenarios.

### 1. Merging DataFrames
• **Definition**: Merging combines two DataFrames based on a common key or index, similar to SQL joins.
• **Method**:
  - `merge()`: Combines DataFrames based on specified columns or indices.
    - `how`: Type of merge to be performed (inner, outer, left, right).
    - `on`: Column(s) to join on.
    - `left_on` and `right_on`: Columns from the left and right DataFrames to join on.

**Example**:
```python
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'EmployeeID': [1, 2, 4],
    'Salary': [50000, 60000, 70000]
})

# Merging DataFrames
merged_inner = pd.merge(df1, df2, on='EmployeeID', how='inner')
print(merged_inner)

merged_outer = pd.merge(df1, df2, on='EmployeeID', how='outer')
print(merged_outer)
```
**Output**:
```
   EmployeeID     Name  Salary
0           1    Alice   50000
1           2      Bob   60000
```
```
   EmployeeID     Name   Salary
0           1    Alice   50000.0
1           2      Bob   60000.0
2           3  Charlie       NaN
3           4      NaN   70000.0
```

### 2. Joining DataFrames
• **Definition**: Joining is a method of combining DataFrames based on their indices.
• **Method**:
  - `join()`: Combines DataFrames using their indices.
    - `how`: Type of join to be performed (inner, outer, left, right).

**Example**:
```python
# Sample DataFrames with indices
df3 = pd.DataFrame({
    'Salary': [50000, 60000, 70000]},
    index=[1, 2, 3]
)

df4 = pd.DataFrame({
    'Department': ['HR', 'IT', 'Finance']},
    index=[1, 2, 4]
)

# Joining DataFrames
joined_inner = df3.join(df4, how='inner')
print(joined_inner)

joined_outer = df3.join(df4, how='outer')
print(joined_outer)
```
**Output**:
```
   Salary Department
1  50000.0        HR
2  60000.0        IT
```
```
   Salary Department
1  50000.0        HR
2  60000.0        IT
3      NaN        NaN
4      NaN   Finance
```

### 3. Concatenating DataFrames
• **Definition**: Concatenation combines DataFrames along a particular axis (rows or columns).
• **Method**:
  - `concat()`: Combines DataFrames along a specified axis.
    - `axis`: 0 for rows, 1 for columns.
    - `ignore_index`: Boolean to reset the index.

**Example**:
```python
# Sample DataFrames
df5 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [24, 30]
})

df6 = pd.DataFrame({
    'Name': ['Charlie', 'David'],
    'Age': [22, 35]
})

# Concatenating DataFrames vertically (along rows)
concatenated_rows = pd.concat([df5, df6], axis=0, ignore_index=True)
print(concatenated_rows)

# Concatenating DataFrames horizontally (along columns)
df7 = pd.DataFrame({
    'Salary': [50000, 60000]
})

concatenated_columns = pd.concat([df5, df7], axis=1)
print(concatenated_columns)
```
**Output**:
```
      Name  Age
0    Alice   24
1      Bob   30
2  Charlie   22
3    David   35
```
```
      Name  Age  Salary
0    Alice   24   50000
1      Bob   30   60000
```

### 4. Handling Duplicates in Concatenation
• **Definition**: Managing duplicate entries when concatenating DataFrames.
• **Method**:
  - `drop_duplicates()`: Removes duplicate rows from the concatenated DataFrame.

**Example**:
```python
# Sample DataFrames with duplicates
df8 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [24, 30]
})

df9 = pd.DataFrame({
    'Name': ['Alice', 'David'],
    'Age': [24, 35]
})

# Concatenating and removing duplicates
concatenated_unique = pd.concat([df8, df9]).drop_duplicates().reset_index(drop=True)
print(concatenated_unique)
```
**Output**:
```
      Name  Age
0    Alice   24
1      Bob   30
2    David   35
```

### 5. Concatenating with Different Columns
• **Definition**: Concatenating DataFrames with different columns.
• **Method**:
  - `concat()` will fill missing values with NaN for non-matching columns.

**Example**:
```python
# Sample DataFrames with different columns
df10 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [24, 30]
})

df11 = pd.DataFrame({
    'Name': ['Charlie', 'David'],
    'Salary': [45000, 70000]
})

# Concatenating DataFrames with different columns
concatenated_diff_columns = pd.concat([df10, df11], axis=0, ignore_index=True)
print(concatenated_diff_columns)
```
**Output**:
```
      Name   Age   Salary
0    Alice  24.0      NaN
1      Bob  30.0      NaN
2  Charlie   NaN  45000.0
3    David   NaN  70000.0
```

---
---
---

### **GroupBy Operations**

* GroupBy operations in Pandas are powerful tools for aggregating and summarizing data based on specific criteria. They allow you to split the data into groups, apply a function to each group, and combine the results back into a DataFrame.

### 1. Introduction to GroupBy
• **Definition**: The GroupBy operation involves splitting the data into groups based on some criteria, applying a function to each group, and then combining the results.
• **Method**:
  - `groupby()`: Used to group the DataFrame by one or more columns.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)

# Grouping by 'Category'
grouped = df.groupby('Category')
print(grouped)
```
**Output**:
```
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
```

### 2. Aggregating Data
• **Definition**: Applying aggregation functions to each group to summarize the data.
• **Common Aggregation Functions**:
  - `sum()`: Sum of values.
  - `mean()`: Average of values.
  - `count()`: Count of non-null values.
  - `min()`: Minimum value.
  - `max()`: Maximum value.

**Example**:
```python
# Aggregating data using sum
aggregated_sum = grouped.sum()
print(aggregated_sum)

# Aggregating data using mean
aggregated_mean = grouped.mean()
print(aggregated_mean)
```
**Output**:
```
          Values
Category        
A             90
B            120
```
```
          Values
Category        
A            30.0
B            40.0
```

### 3. Applying Multiple Aggregation Functions
• **Definition**: Applying multiple aggregation functions to the grouped data.
• **Method**:
  - `agg()`: Allows you to specify multiple aggregation functions.

**Example**:
```python
# Applying multiple aggregation functions
aggregated_multiple = grouped.agg(['sum', 'mean', 'count'])
print(aggregated_multiple)
```
**Output**:
```
          Values           
            sum  mean count
Category                  
A             90  30.0    3
B            120  40.0    3
```

### 4. Grouping by Multiple Columns
• **Definition**: Grouping data based on multiple columns.
• **Method**:
  - Pass a list of column names to `groupby()`.

**Example**:
```python
# Sample DataFrame with multiple grouping columns
data_multi = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Subcategory': ['X', 'Y', 'Y', 'X', 'X', 'Y'],
    'Values': [10, 20, 30, 40, 50, 60]
}
df_multi = pd.DataFrame(data_multi)

# Grouping by 'Category' and 'Subcategory'
grouped_multi = df_multi.groupby(['Category', 'Subcategory']).sum()
print(grouped_multi)
```
**Output**:
```
                   Values
Category Subcategory       
A        X            60
         Y            30
B        X            40
         Y            80
```

### 5. Filtering Groups
• **Definition**: Filtering groups based on a condition.
• **Method**:
  - `filter()`: Returns a DataFrame with groups that meet a specified condition.

**Example**:
```python
# Filtering groups where the sum of values is greater than 50
filtered_groups = grouped.filter(lambda x: x['Values'].sum() > 50)
print(filtered_groups)
```
**Output**:
```
  Category  Values
1        B      20
3        B      40
5        B      60
```

### 6. Transforming Data
• **Definition**: Applying a function to each group and returning a DataFrame with the same shape as the original.
• **Method**:
  - `transform()`: Used to perform operations that return a DataFrame with the same index as the original.

**Example**:
```python
# Transforming data to get the mean of each group
transformed = grouped.transform('mean')
print(transformed)
```
**Output**:
```
   Values
0    30.0
1    40.0
2    30.0
3    40.0
4    30.0
5    40.0
```

### 7. Custom Aggregation Functions
• **Definition**: Using custom functions for aggregation.
• **Method**:
  - Pass a custom function to `agg()`.

**Example**:
```python
# Custom aggregation function to calculate range
def range_func(x):
    return x.max() - x.min()

# Applying custom aggregation function
custom_agg = grouped.agg(range=range_func)
print(custom_agg)
```
**Output**:
```
          Values
Category        
A             40
B             40
```

### 8. GroupBy with Time Series Data
• **Definition**: Grouping time series data by time intervals.
• **Method**:
  - Use `Grouper` to specify the frequency.

**Example**:
```python
# Sample time series DataFrame
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
df_time = pd.DataFrame(date_rng, columns=['date'])
df_time['data'] = pd.Series(range(1, len(df_time) + 1))

# Grouping by day
df_time.set_index('date', inplace=True)
grouped_time = df_time.groupby(pd.Grouper(freq='2D')).sum()
print(grouped_time)
```
**Output**:
```
            data
date             
2023-01-01    3
2023-01-03    7
2023-01-05   11
2023-01-07   15
2023-01-09   19
```
---
---
---

### **Aggregation and Descriptive Statistics**

Aggregation and descriptive statistics in Pandas are essential for summarizing and understanding datasets. They provide insights into the data's central tendency, dispersion, and overall distribution. Below are the key topics related to aggregation and descriptive statistics in Pandas, along with definitions, use cases, and examples.

### 1. Introduction to Aggregation
• **Definition**: Aggregation involves computing a summary statistic for a group of data points. It allows you to condense large datasets into meaningful metrics.
• **Common Aggregation Functions**:
  - `sum()`: Total sum of values.
  - `mean()`: Average of values.
  - `count()`: Number of non-null values.
  - `min()`: Minimum value.
  - `max()`: Maximum value.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)

# Aggregating data using sum
aggregated_sum = df.groupby('Category')['Values'].sum()
print(aggregated_sum)
```
**Output**:
```
Category
A     90
B    120
Name: Values, dtype: int64
```

### 2. Descriptive Statistics
• **Definition**: Descriptive statistics provide a summary of the main characteristics of a dataset, including measures of central tendency and variability.
• **Method**:
  - `describe()`: Generates descriptive statistics for numerical columns.

**Example**:
```python
# Generating descriptive statistics
descriptive_stats = df['Values'].describe()
print(descriptive_stats)
```
**Output**:
```
count     6.000000
mean     35.000000
std      18.520259
min      10.000000
25%      25.000000
50%      35.000000
75%      45.000000
max      60.000000
Name: Values, dtype: float64
```

### 3. Applying Multiple Aggregation Functions
• **Definition**: You can apply multiple aggregation functions to summarize data in various ways.
• **Method**:
  - `agg()`: Allows you to specify multiple aggregation functions.

**Example**:
```python
# Applying multiple aggregation functions
aggregated_multiple = df.groupby('Category')['Values'].agg(['sum', 'mean', 'count'])
print(aggregated_multiple)
```
**Output**:
```
          sum  mean  count
Category                  
A          90  30.0      3
B         120  40.0      3
```

### 4. Custom Aggregation Functions
• **Definition**: You can define and apply custom functions for aggregation.
• **Method**:
  - Pass a custom function to `agg()`.

**Example**:
```python
# Custom aggregation function to calculate range
def range_func(x):
    return x.max() - x.min()

# Applying custom aggregation function
custom_agg = df.groupby('Category')['Values'].agg(range=range_func)
print(custom_agg)
```
**Output**:
```
          range
Category       
A             40
B             40
```

### 5. Descriptive Statistics for Categorical Data
• **Definition**: Descriptive statistics can also be applied to categorical data to summarize counts and unique values.
• **Method**:
  - `value_counts()`: Returns the counts of unique values in a Series.

**Example**:
```python
# Sample DataFrame with categorical data
data_cat = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Subcategory': ['X', 'Y', 'Y', 'X', 'X', 'Y']
}
df_cat = pd.DataFrame(data_cat)

# Descriptive statistics for categorical data
category_counts = df_cat['Category'].value_counts()
print(category_counts)
```
**Output**:
```
A    3
B    3
Name: Category, dtype: int64
```

### 6. Correlation and Covariance
• **Definition**: Correlation measures the relationship between two variables, while covariance indicates the direction of the relationship.
• **Methods**:
  - `corr()`: Computes pairwise correlation of columns.
  - `cov()`: Computes pairwise covariance of columns.

**Example**:
```python
# Sample DataFrame for correlation and covariance
data_corr = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
}
df_corr = pd.DataFrame(data_corr)

# Calculating correlation
correlation = df_corr.corr()
print(correlation)

# Calculating covariance
covariance = df_corr.cov()
print(covariance)
```
**Output (Correlation)**:
```
          A    B    C
A  1.000000 -1.0  1.0
B -1.000000  1.0 -1.0
C  1.000000 -1.0  1.0
```
**Output (Covariance)**:
```
          A    B    C
A  2.500000 -2.500000  2.500000
B -2.500000  2.500000 -2.500000
C  2.500000 -2.500000  2.500000
```

### 7. Grouping and Aggregating with Descriptive Statistics
• **Definition**: You can combine grouping and descriptive statistics to summarize data based on categories.
• **Method**:
  - Use `groupby()` followed by `describe()`.

**Example**:
```python
# Grouping and generating descriptive statistics
grouped_descriptive = df.groupby('Category')['Values'].describe()
print(grouped_descriptive)
```
**Output**:
```
          count  mean       std   min   25%   50%   75%   max
Category                                                      
A          3.0  30.0  20.000000  10.0  25.0  30.0  45.0  50.0
B          3.0  40.0  20.000000  20.0  30.0  40.0  50.0  60.0
```

---
---
---

### **Pivot Tables and Cross Validation**

* Pivot tables and cross-validation are important concepts in data analysis and machine learning. In Pandas, pivot tables allow you to summarize and reorganize data, while cross-validation is a technique used to assess the performance of machine learning models.

### 1. Pivot Tables in Pandas
• **Definition**: A pivot table is a data processing tool that allows you to summarize and reorganize data in a DataFrame. It enables you to aggregate data based on one or more keys and display the results in a tabular format.
• **Method**:
  - `pivot_table()`: Creates a pivot table from a DataFrame.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 250]
}
df = pd.DataFrame(data)

# Creating a pivot table
pivot_table = df.pivot_table(values='Sales', index='Date', columns='Category', aggfunc='sum', fill_value=0)
print(pivot_table)
```
**Output**:
```
Category         A    B
Date                   
2023-01-01    100  200
2023-01-02    150  250
```

### 2. Pivot Table with Multiple Aggregation Functions
• **Definition**: You can apply multiple aggregation functions to summarize data in a pivot table.
• **Method**:
  - Use the `aggfunc` parameter to specify a list of functions.

**Example**:
```python
# Creating a pivot table with multiple aggregation functions
pivot_table_multi = df.pivot_table(values='Sales', index='Date', columns='Category', aggfunc=[sum, 'mean'], fill_value=0)
print(pivot_table_multi)
```
**Output**:
```
           sum         mean       
Category     A    B     A    B
Date                             
2023-01-01  100  200  100.0  200.0
2023-01-02  150  250  150.0  250.0
```

### 3. Cross-Validation in Pandas
• **Definition**: Cross-validation is a technique used to evaluate the performance of a machine learning model by partitioning the data into subsets. The model is trained on some subsets and tested on others to ensure it generalizes well to unseen data.
• **Method**:
  - Use `KFold` or `StratifiedKFold` from `sklearn.model_selection` to create cross-validation splits.

**Example**:
```python
from sklearn.model_selection import KFold
import numpy as np

# Sample data
data = {
    'Feature1': [1, 2, 3, 4, 5, 6],
    'Feature2': [10, 20, 30, 40, 50, 60],
    'Target': [0, 1, 0, 1, 0, 1]
}
df_cv = pd.DataFrame(data)

# Defining features and target
X = df_cv[['Feature1', 'Feature2']]
y = df_cv['Target']

# Setting up KFold cross-validation
kf = KFold(n_splits=3)

# Performing cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print("TRAIN:", train_index, "TEST:", test_index)
```
**Output**:
```
TRAIN: [2 3 4 5] TEST: [0 1]
TRAIN: [0 1 4 5] TEST: [2 3]
TRAIN: [0 1 2 3] TEST: [4 5]
```

### 4. Stratified Cross-Validation
• **Definition**: Stratified cross-validation ensures that each fold has the same proportion of classes as the entire dataset, which is particularly useful for imbalanced datasets.
• **Method**:
  - Use `StratifiedKFold` from `sklearn.model_selection`.

**Example**:
```python
from sklearn.model_selection import StratifiedKFold

# Setting up StratifiedKFold cross-validation
skf = StratifiedKFold(n_splits=3)

# Performing stratified cross-validation
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    print("TRAIN:", train_index, "TEST:", test_index)
```
**Output**:
```
TRAIN: [0 1 2 3] TEST: [4 5]
TRAIN: [0 1 2 4 5] TEST: [3]
TRAIN: [0 1 3 4 5] TEST: [2]
```

### 5. Evaluating Model Performance
• **Definition**: After performing cross-validation, you can evaluate the model's performance using metrics such as accuracy, precision, recall, and F1-score.
• **Method**:
  - Use metrics from `sklearn.metrics`.

**Example**:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize model
model = LogisticRegression()

# Cross-validation evaluation
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
```
**Output**:
```
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0
```

---
---
---

### **Reshaping Data (Melt , Stack, Unstack)**

* Reshaping data in Pandas is essential for transforming data into a format that is more suitable for analysis or visualization. The `melt`, `stack`, and `unstack` functions are powerful tools for reshaping DataFrames.

### 1. Melt
• **Definition**: The `melt` function is used to transform a DataFrame from a wide format to a long format. It unpivots the DataFrame, turning columns into rows.
• **Method**:
  - `melt()`: Takes a DataFrame and returns a new DataFrame in long format.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Date': ['2023-01-01', '2023-01-02'],
    'Sales_A': [100, 150],
    'Sales_B': [200, 250]
}
df = pd.DataFrame(data)

# Melting the DataFrame
melted_df = pd.melt(df, id_vars=['Date'], value_vars=['Sales_A', 'Sales_B'],
                    var_name='Category', value_name='Sales')
print(melted_df)
```
**Output**:
```
         Date   Category  Sales
0  2023-01-01  Sales_A    100
1  2023-01-02  Sales_A    150
2  2023-01-01  Sales_B    200
3  2023-01-02  Sales_B    250
```

### 2. Stack
• **Definition**: The `stack` function is used to pivot the columns of a DataFrame into the index, effectively converting a DataFrame from wide to long format. It stacks the columns into a single column.
• **Method**:
  - `stack()`: Stacks the columns of a DataFrame into a Series.

**Example**:
```python
# Sample DataFrame
data_stack = {
    'Date': ['2023-01-01', '2023-01-02'],
    'Sales_A': [100, 150],
    'Sales_B': [200, 250]
}
df_stack = pd.DataFrame(data_stack).set_index('Date')

# Stacking the DataFrame
stacked_df = df_stack.stack()
print(stacked_df)
```
**Output**:
```
Date          
2023-01-01  Sales_A    100
            Sales_B    200
2023-01-02  Sales_A    150
            Sales_B    250
dtype: int64
```

### 3. Unstack
• **Definition**: The `unstack` function is the inverse of `stack`. It pivots the innermost level of the index (or a specified level) into columns, converting a long format DataFrame back to a wide format.
• **Method**:
  - `unstack()`: Converts the innermost index level to columns.

**Example**:
```python
# Unstacking the stacked DataFrame
unstacked_df = stacked_df.unstack()
print(unstacked_df)
```
**Output**:
```
Category      Sales_A  Sales_B
Date                          
2023-01-01      100      200
2023-01-02      150      250
```

### 4. Reshaping with MultiIndex
• **Definition**: You can create a MultiIndex DataFrame and use `stack` and `unstack` to reshape data with multiple levels of indexing.
• **Example**:
```python
# Sample DataFrame with MultiIndex
data_multi = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 250]
}
df_multi = pd.DataFrame(data_multi)

# Setting MultiIndex
df_multi.set_index(['Date', 'Category'], inplace=True)

# Stacking and unstacking with MultiIndex
stacked_multi = df_multi.stack()
print(stacked_multi)

unstacked_multi = stacked_multi.unstack()
print(unstacked_multi)
```
**Output (Stacked)**:
```
Date        Category
2023-01-01 A          100
            B          200
2023-01-02 A          150
            B          250
dtype: int64
```
**Output (Unstacked)**:
```
Category      A    B
Date                
2023-01-01  100  200
2023-01-02  150  250
```

### 5. Using `pivot` for Reshaping
• **Definition**: The `pivot` function is another way to reshape data, similar to `pivot_table`, but it does not allow for aggregation.
• **Method**:
  - `pivot()`: Reshapes data based on unique values from specified columns.

**Example**:
```python
# Sample DataFrame
data_pivot = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 250]
}
df_pivot = pd.DataFrame(data_pivot)

# Pivoting the DataFrame
pivoted_df = df_pivot.pivot(index='Date', columns='Category', values='Sales')
print(pivoted_df)
```
**Output**:
```
Category        A    B
Date                  
2023-01-01  100.0  200.0
2023-01-02  150.0  250.0
```

---
---
---

### **Time Series Data**

* Time series data is a sequence of data points indexed in time order, often used for analyzing trends, seasonal patterns, and forecasting.
* Pandas provides powerful tools for working with time series data, making it easy to manipulate, analyze, and visualize temporal data.

### 1. Creating Time Series Data
• **Definition**: You can create a time series DataFrame by using a date range or by converting a column to datetime format.
• **Method**:
  - `pd.date_range()`: Generates a range of dates.
  - `pd.to_datetime()`: Converts a column to datetime format.

**Example**:
```python
import pandas as pd

# Creating a date range
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

# Creating a DataFrame with time series data
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = range(1, len(df) + 1)
df.set_index('date', inplace=True)
print(df)
```
**Output**:
```
            data
date             
2023-01-01     1
2023-01-02     2
2023-01-03     3
2023-01-04     4
2023-01-05     5
2023-01-06     6
2023-01-07     7
2023-01-08     8
2023-01-09     9
2023-01-10    10
```

### 2. Indexing and Selecting Time Series Data
• **Definition**: You can index and select data based on date ranges or specific timestamps.
• **Method**:
  - Use the DataFrame index to filter data.

**Example**:
```python
# Selecting data for a specific date
selected_data = df.loc['2023-01-05']
print(selected_data)

# Selecting data for a date range
range_data = df['2023-01-03':'2023-01-07']
print(range_data)
```
**Output**:
```
data    5
Name: 2023-01-05 00:00:00, dtype: int64
```
```
            data
date             
2023-01-03     3
2023-01-04     4
2023-01-05     5
2023-01-06     6
2023-01-07     7
```

### 3. Resampling Time Series Data
• **Definition**: Resampling is the process of changing the frequency of your time series data, either by upsampling (increasing frequency) or downsampling (decreasing frequency).
• **Method**:
  - `resample()`: Used to change the frequency of the time series data.

**Example**:
```python
# Resampling to a different frequency (e.g., weekly)
weekly_data = df.resample('W').sum()
print(weekly_data)
```
**Output**:
```
            data
date             
2023-01-01     1
2023-01-08    28
2023-01-15    10
```

### 4. Time Series Operations
• **Definition**: You can perform various operations on time series data, such as shifting, rolling windows, and calculating differences.
• **Methods**:
  - `shift()`: Shifts the data by a specified number of periods.
  - `rolling()`: Provides rolling window calculations.

**Example**:
```python
# Shifting data
shifted_data = df.shift(1)
print(shifted_data)

# Rolling window calculation (e.g., 3-day moving average)
rolling_avg = df.rolling(window=3).mean()
print(rolling_avg)
```
**Output (Shifted)**:
```
            data
date             
2023-01-01   NaN
2023-01-02   1.0
2023-01-03   2.0
2023-01-04   3.0
2023-01-05   4.0
2023-01-06   5.0
2023-01-07   6.0
2023-01-08   7.0
2023-01-09   8.0
2023-01-10   9.0
```
```
            data
date             
2023-01-01   NaN
2023-01-02   NaN
2023-01-03   2.0
2023-01-04   3.0
2023-01-05   4.0
2023-01-06   5.0
2023-01-07   6.0
2023-01-08   7.0
2023-01-09   8.0
2023-01-10   9.0
```

### 5. Time Series Visualization
• **Definition**: Visualizing time series data helps in understanding trends and patterns over time.
• **Method**:
  - Use Matplotlib or Pandas built-in plotting functions.

**Example**:
```python
import matplotlib.pyplot as plt

# Plotting the time series data
df.plot(figsize=(10, 5))
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Data')
plt.grid()
plt.show()
```

### 6. Handling Missing Data in Time Series
• **Definition**: Time series data often contains missing values, which can be handled using various methods.
• **Methods**:
  - `fillna()`: Fill missing values.
  - `interpolate()`: Interpolate missing values.

**Example**:
```python
# Introducing missing values
df_missing = df.copy()
df_missing.loc['2023-01-03'] = None

# Filling missing values
filled_data = df_missing.fillna(method='ffill')
print(filled_data)

# Interpolating missing values
interpolated_data = df_missing.interpolate()
print(interpolated_data)
```
**Output (Filled)**:
```
            data
date             
2023-01-01   1.0
2023-01-02   2.0
2023-01-03   2.0
2023-01-04   4.0
2023-01-05   5.0
2023-01-06   6.0
2023-01-07   7.0
2023-01-08   8.0
2023-01-09   9.0
2023-01-10  10.0
```
```
            data
date             
2023-01-01   1.0
2023-01-02   2.0
2023-01-03   2.5
2023-01-04   4.0
2023-01-05   5.0
2023-01-06   6.0
2023-01-07   7.0
2023-01-08   8.0
2023-01-09   9.0
2023-01-10  10.0
```

### 7. Time Zone Handling
• **Definition**: Time series data can include time zone information, which is important for accurate analysis.
• **Method**:
  - `tz_localize()`: Localizes naive datetime to a specific time zone.
  - `tz_convert()`: Converts time zone-aware datetime to another time zone.

**Example**:
```python
# Localizing to a specific time zone
df_tz = df.tz_localize('UTC')
print(df_tz)

# Converting to another time zone
df_tz_converted = df_tz.tz_convert('America/New_York')
print(df_tz_converted)
```
**Output (Localized)**:
```
                     data
date                     
2023-01-01 00:00:00+00:00   1
2023-01-02 00:00:00+00:00   2
2023-01-03 00:00:00+00:00   3
2023-01-04 00:00:00+00:00   4
2023-01-05 00:00:00+00:00   5
2023-01-06 00:00:00+00:00   6
2023-01-07 00:00:00+00:00   7
2023-01-08 00:00:00+00:00   8
2023-01-09 00:00:00+00:00   9
2023-01-10 00:00:00+00:00  10
```
```
                     data
date                     
2022-12-31 19:00:00-05:00   1
2023-01-01 19:00:00-05:00   2
2023-01-02 19:00:00-05:00   3
2023-01-03 19:00:00-05:00   4
2023-01-04 19:00:00-05:00   5
2023-01-05 19:00:00-05:00   6
2023-01-06 19:00:00-05:00   7
2023-01-07 19:00:00-05:00   8
2023-01-08 19:00:00-05:00   9
2023-01-09 19:00:00-05:00  10
```

---
---
---

## **Working with Dates and Times**

* Working with dates and times in Pandas is essential for time series analysis and data manipulation.
* Pandas provides a variety of functions and methods to handle date and time data effectively.

### 1. Creating Date and Time Objects
• **Definition**: You can create date and time objects using `pd.to_datetime()` or by generating a date range with `pd.date_range()`.
• **Method**:
  - `pd.to_datetime()`: Converts a string or a list of strings to datetime objects.
  - `pd.date_range()`: Generates a range of dates.

**Example**:
```python
import pandas as pd

# Creating a single datetime object
date_single = pd.to_datetime('2023-01-01')
print(date_single)

# Creating a range of dates
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
print(date_rng)
```
**Output**:
```
2023-01-01 00:00:00
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10'],
              dtype='datetime64[ns]', freq='D')
```

### 2. Converting Strings to Datetime
• **Definition**: You can convert strings in various formats to datetime objects using `pd.to_datetime()`.
• **Method**:
  - `pd.to_datetime()`: Automatically infers the format or you can specify the format.

**Example**:
```python
# Converting a string to datetime
date_str = '2023-01-01 12:30:45'
date_converted = pd.to_datetime(date_str)
print(date_converted)

# Converting a list of strings to datetime
date_list = ['2023-01-01', '2023-01-02', '2023-01-03']
date_converted_list = pd.to_datetime(date_list)
print(date_converted_list)
```
**Output**:
```
2023-01-01 12:30:45
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'], dtype='datetime64[ns]', freq=None)
```

### 3. Extracting Date and Time Components
• **Definition**: You can extract specific components (year, month, day, hour, minute, second) from datetime objects.
• **Method**:
  - Use the `.dt` accessor to access date and time properties.

**Example**:
```python
# Sample DataFrame with datetime index
date_rng = pd.date_range(start='2023-01-01', end='2023-01-05', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = range(1, len(df) + 1)
df.set_index('date', inplace=True)

# Extracting components
df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day
df['hour'] = df.index.hour
print(df)
```
**Output**:
```
            data  year  month  day  hour
date                                     
2023-01-01     1  2023      1    1     0
2023-01-02     2  2023      1    2     0
2023-01-03     3  2023      1    3     0
2023-01-04     4  2023      1    4     0
2023-01-05     5  2023      1    5     0
```

### 4. Date Arithmetic
• **Definition**: You can perform arithmetic operations on datetime objects, such as adding or subtracting time intervals.
• **Method**:
  - Use `pd.Timedelta` to represent time durations.

**Example**:
```python
# Adding days to a datetime
df['date_plus_3_days'] = df.index + pd.Timedelta(days=3)
print(df)

# Subtracting days from a datetime
df['date_minus_2_days'] = df.index - pd.Timedelta(days=2)
print(df)
```
**Output (Adding Days)**:
```
            data  year  month  day  hour date_plus_3_days
date                                                    
2023-01-01     1  2023      1    1     0       2023-01-04
2023-01-02     2  2023      1    2     0       2023-01-05
2023-01-03     3  2023      1    3     0       2023-01-06
2023-01-04     4  2023      1    4     0       2023-01-07
2023-01-05     5  2023      1    5     0       2023-01-08
```
**Output (Subtracting Days)**:
```
            data  year  month  day  hour date_plus_3_days date_minus_2_days
date                                                                             
2023-01-01     1  2023      1    1     0       2023-01-04       2022-12-30
2023-01-02     2  2023      1    2     0       2023-01-05       2022-12-31
2023-01-03     3  2023      1    3     0       2023-01-06       2023-01-01
2023-01-04     4  2023      1    4     0       2023-01-07       2023-01-02
2023-01-05     5  2023      1    5     0       2023-01-08       2023-01-03
```

### 5. Time Zone Handling
• **Definition**: You can localize naive datetime objects to a specific time zone and convert between time zones.
• **Method**:
  - `tz_localize()`: Localizes naive datetime to a specific time zone.
  - `tz_convert()`: Converts time zone-aware datetime to another time zone.

**Example**:
```python
# Localizing to a specific time zone
df_tz = df.tz_localize('UTC')
print(df_tz)

# Converting to another time zone
df_tz_converted = df_tz.tz_convert('America/New_York')
print(df_tz_converted)
```
**Output (Localized)**:
```
                     data  year  month  day  hour date_plus_3_days date_minus_2_days
date                                                                             
2023-01-01 00:00:00+00:00     1  2023      1    1     0       2023-01-04       2022-12-30
2023-01-02 00:00:00+00:00     2  2023      1    2     0       2023-01-05       2022-12-31
2023-01-03 00:00:00+00:00     3  2023      1    3     0       2023-01-06       2023-01-01
2023-01-04 00:00:00+00:00     4  2023      1    4     0       2023-01-07       2023-01-02
2023-01-05 00:00:00+00:00     5  2023      1    5     0       2023-01-08       2023-01-03
```
```
                     data  year  month  day  hour date_plus_3_days date_minus_2_days
date                                                                             
2022-12-31 19:00:00-05:00     1  2023      1    1     0       2023-01-04       2022-12-30
2023-01-01 19:00:00-05:00     2  2023      1    2     0       2023-01-05       2022-12-31
2023-01-02 19:00:00-05:00     3  2023      1    3     0       2023-01-06       2023-01-01
2023-01-03 19:00:00-05:00     4  2023      1    4     0       2023-01-07       2023-01-02
2023-01-04 19:00:00-05:00     5  2023      1    5     0       2023-01-08       2023-01-03
```

### 6. Working with Timedelta
• **Definition**: Timedelta represents the difference between two dates or times.
• **Method**:
  - `pd.Timedelta()`: Represents a duration of time.

**Example**:
```python
# Creating a Timedelta
delta = pd.Timedelta(days=5, hours=3)
print(delta)

# Adding Timedelta to a datetime
new_date = date_single + delta
print(new_date)
```
**Output**:
```
5 days 03:00:00
2023-01-06 03:00:00
```

### 7. Date Range Generation
• **Definition**: You can generate a range of dates with specific frequencies.
• **Method**:
  - `pd.date_range()`: Generates a range of dates with specified frequency.

**Example**:
```python
# Generating a date range with different frequencies
daily_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
weekly_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='W')
print(daily_rng)
print(weekly_rng)
```
**Output (Daily)**:
```
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10'],
              dtype='datetime64[ns]', freq='D')
```
**Output (Weekly)**:
```
DatetimeIndex(['2023-01-01', '2023-01-08'], dtype='datetime64[ns]', freq='W-SUN')
```
---
---
---

### **Window Functions (Rolling , Expanding)**

* Window functions in Pandas, such as rolling and expanding, are powerful tools for performing calculations over a specified window of data.
* These functions allow you to analyze trends, calculate moving averages, and perform cumulative calculations.


### 1. Rolling Window Functions
• **Definition**: Rolling window functions allow you to perform calculations over a fixed-size window of data that moves along the time series or DataFrame. Common operations include calculating the mean, sum, or standard deviation over the window.
• **Method**:
  - `rolling()`: Creates a rolling window object.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

# Calculating a rolling mean with a window size of 3
rolling_mean = df['Values'].rolling(window=3).mean()
print(rolling_mean)
```
**Output**:
```
Date
2023-01-01     NaN
2023-01-02     NaN
2023-01-03    20.0
2023-01-04    30.0
2023-01-05    40.0
2023-01-06    50.0
2023-01-07    60.0
2023-01-08    70.0
2023-01-09    80.0
2023-01-10    90.0
Name: Values, dtype: float64
```

### 2. Rolling Window with Different Functions
• **Definition**: You can apply various aggregation functions to the rolling window, such as sum, min, max, and standard deviation.
• **Method**:
  - Use the `agg()` method to apply multiple functions.

**Example**:
```python
# Calculating rolling sum and standard deviation
rolling_sum = df['Values'].rolling(window=3).sum()
rolling_std = df['Values'].rolling(window=3).std()

# Combining results into a DataFrame
rolling_results = pd.DataFrame({
    'Rolling Mean': rolling_mean,
    'Rolling Sum': rolling_sum,
    'Rolling Std': rolling_std
})
print(rolling_results)
```
**Output**:
```
            Rolling Mean  Rolling Sum  Rolling Std
Date                                                  
2023-01-01            NaN          NaN           NaN
2023-01-02            NaN          NaN           NaN
2023-01-03           20.0         60.0           NaN
2023-01-04           30.0         90.0      10.000000
2023-01-05           40.0        120.0      10.000000
2023-01-06           50.0        150.0      10.000000
2023-01-07           60.0        180.0      10.000000
2023-01-08           70.0        210.0      10.000000
2023-01-09           80.0        240.0      10.000000
2023-01-10           90.0        270.0      10.000000
```

### 3. Expanding Window Functions
• **Definition**: Expanding window functions allow you to perform calculations over all data points up to the current point. This is useful for cumulative calculations.
• **Method**:
  - `expanding()`: Creates an expanding window object.

**Example**:
```python
# Calculating expanding mean
expanding_mean = df['Values'].expanding().mean()
print(expanding_mean)
```
**Output**:
```
Date
2023-01-01     10.0
2023-01-02     15.0
2023-01-03     20.0
2023-01-04     25.0
2023-01-05     30.0
2023-01-06     35.0
2023-01-07     40.0
2023-01-08     45.0
2023-01-09     50.0
2023-01-10     55.0
Name: Values, dtype: float64
```

### 4. Expanding Window with Different Functions
• **Definition**: Similar to rolling windows, you can apply various aggregation functions to the expanding window.
• **Method**:
  - Use the `agg()` method to apply multiple functions.

**Example**:
```python
# Calculating expanding sum and standard deviation
expanding_sum = df['Values'].expanding().sum()
expanding_std = df['Values'].expanding().std()

# Combining results into a DataFrame
expanding_results = pd.DataFrame({
    'Expanding Mean': expanding_mean,
    'Expanding Sum': expanding_sum,
    'Expanding Std': expanding_std
})
print(expanding_results)
```
**Output**:
```
            Expanding Mean  Expanding Sum  Expanding Std
Date                                                      
2023-01-01            10.0           10.0             NaN
2023-01-02            15.0           30.0             NaN
2023-01-03            20.0           60.0             NaN
2023-01-04            25.0          100.0        15.811388
2023-01-05            30.0          150.0        18.708286
2023-01-06            35.0          210.0        21.633308
2023-01-07            40.0          280.0        24.000000
2023-01-08            45.0          360.0        26.832815
2023-01-09            50.0          450.0        29.700000
2023-01-10            55.0          550.0        32.000000
```

### 5. Customizing Window Functions
• **Definition**: You can customize the behavior of rolling and expanding windows by specifying parameters such as minimum periods and center alignment.
• **Method**:
  - Use parameters like `min_periods` and `center`.

**Example**:
```python
# Rolling mean with minimum periods
rolling_mean_min_periods = df['Values'].rolling(window=3, min_periods=1).mean()
print(rolling_mean_min_periods)

# Rolling mean with center alignment
rolling_mean_centered = df['Values'].rolling(window=3, center=True).mean()
print(rolling_mean_centered)
```
**Output (Min Periods)**:
```
Date
2023-01-01    10.0
2023-01-02    15.0
2023-01-03    20.0
2023-01-04    30.0
2023-01-05    40.0
2023-01-06    50.0
2023-01-07    60.0
2023-01-08    70.0
2023-01-09    80.0
2023-01-10    90.0
Name: Values, dtype: float64
```
```
Date
2023-01-01    15.0
2023-01-02    20.0
2023-01-03    30.0
2023-01-04    40.0
2023-01-05    50.0
2023-01-06    60.0
2023-01-07    70.0
2023-01-08    80.0
2023-01-09    90.0
2023-01-10    100.0
Name: Values, dtype: float64
```

---
---
---

### **Handling Text Data**

* Handling text data in Pandas is essential for data cleaning, preprocessing, and analysis.
* Pandas provides a variety of functions and methods to manipulate and analyze text data efficiently.

### 1. Creating a DataFrame with Text Data
• **Definition**: You can create a DataFrame that contains text data, which can be manipulated and analyzed.
• **Method**:
  - Use a dictionary or a list of lists to create a DataFrame.

**Example**:
```python
import pandas as pd

# Sample DataFrame with text data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Occupation': ['Engineer', 'Doctor', 'Artist', 'Chef'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)
```
**Output**:
```
      Name Occupation         City
0    Alice   Engineer     New York
1      Bob     Doctor  Los Angeles
2  Charlie     Artist      Chicago
3    David       Chef      Houston
```

### 2. String Methods in Pandas
• **Definition**: Pandas provides a set of string methods that can be applied to Series containing text data.
• **Method**:
  - Use the `.str` accessor to access string methods.

**Example**:
```python
# Converting text to lowercase
df['Occupation'] = df['Occupation'].str.lower()
print(df)

# Checking if a string contains a substring
df['Is_Engineer'] = df['Occupation'].str.contains('engineer')
print(df)
```
**Output**:
```
      Name Occupation         City
0    Alice   engineer     New York
1      Bob     doctor  Los Angeles
2  Charlie     artist      Chicago
3    David       chef      Houston
```
```
      Name Occupation         City  Is_Engineer
0    Alice   engineer     New York         True
1      Bob     doctor  Los Angeles        False
2  Charlie     artist      Chicago        False
3    David       chef      Houston        False
```

### 3. String Manipulation
• **Definition**: You can perform various string manipulations, such as splitting, joining, replacing, and stripping whitespace.
• **Methods**:
  - `str.split()`: Splits strings into lists.
  - `str.join()`: Joins lists into strings.
  - `str.replace()`: Replaces substrings.
  - `str.strip()`: Removes leading and trailing whitespace.

**Example**:
```python
# Splitting a string into a list
df['City_Split'] = df['City'].str.split(' ')
print(df)

# Joining a list into a string
df['City_Joined'] = df['City_Split'].str.join(', ')
print(df)

# Replacing substrings
df['City'] = df['City'].str.replace('New York', 'NYC')
print(df)

# Stripping whitespace
df['City'] = df['City'].str.strip()
print(df)
```
**Output**:
```
      Name Occupation         City       City_Split
0    Alice   engineer     New York         [New, York]
1      Bob     doctor  Los Angeles     [Los, Angeles]
2  Charlie     artist      Chicago          [Chicago]
3    David       chef      Houston          [Houston]
```
```
      Name Occupation         City       City_Split          City_Joined
0    Alice   engineer         NYC         [New, York]              New, York
1      Bob     doctor  Los Angeles     [Los, Angeles]        Los, Angeles
2  Charlie     artist      Chicago          [Chicago]              Chicago
3    David       chef      Houston          [Houston]              Houston
```
```
      Name Occupation         City       City_Split          City_Joined
0    Alice   engineer         NYC         [NYC]              NYC
1      Bob     doctor  Los Angeles     [Los, Angeles]        Los, Angeles
2  Charlie     artist      Chicago          [Chicago]              Chicago
3    David       chef      Houston          [Houston]              Houston
```

### 4. Text Data Analysis
• **Definition**: You can analyze text data to extract insights, such as counting occurrences of words or characters.
• **Method**:
  - Use string methods to perform analysis.

**Example**:
```python
# Counting the number of characters in each occupation
df['Occupation_Length'] = df['Occupation'].str.len()
print(df)

# Counting occurrences of a specific character
df['A_Count'] = df['Name'].str.count('a')
print(df)
```
**Output**:
```
      Name Occupation         City  Occupation_Length
0    Alice   engineer         NYC                 7
1      Bob     doctor  Los Angeles                 6
2  Charlie     artist      Chicago                 6
3    David       chef      Houston                 4
```
```
      Name Occupation         City  Occupation_Length  A_Count
0    Alice   engineer         NYC                 7       1
1      Bob     doctor  Los Angeles                 6       0
2  Charlie     artist      Chicago                 6       1
3    David       chef      Houston                 4       1
```

### 5. Handling Missing Values in Text Data
• **Definition**: You can handle missing values in text data using methods like `fillna()` or `replace()`.
• **Method**:
  - Use `fillna()` to replace missing values.

**Example**:
```python
# Introducing missing values
df.loc[1, 'Occupation'] = None

# Filling missing values with a default value
df['Occupation'] = df['Occupation'].fillna('Unknown')
print(df)
```
**Output**:
```
      Name Occupation         City  Occupation_Length  A_Count
0    Alice   engineer         NYC                 7       1
1      Bob     Unknown  Los Angeles                 7       0
2  Charlie     artist      Chicago                 6       1
3    David       chef      Houston                 4       1
```

### 6. Regular Expressions with Text Data
• **Definition**: You can use regular expressions (regex) to perform complex string matching and manipulation.
• **Method**:
  - Use `str.contains()`, `str.match()`, and `str.replace()` with regex.

**Example**:
```python
# Checking if the occupation contains the letter 'a'
df['Contains_A'] = df['Occupation'].str.contains('a', case=False)
print(df)

# Replacing occupations that contain 'a' with 'Artist'
df['Occupation'] = df['Occupation'].str.replace(r'.*a.*', 'Artist', regex=True)
print(df)
```
**Output**:
```
      Name Occupation         City  Occupation_Length  A_Count  Contains_A
0    Alice   engineer         NYC                 7       1         True
1      Bob     Unknown  Los Angeles                 7       0        False
2  Charlie     artist      Chicago                 6       1         True
3    David       chef      Houston                 4       1        False
```
```
      Name Occupation         City  Occupation_Length  A_Count  Contains_A
0    Alice     Artist         NYC                 7       1         True
1      Bob     Artist  Los Angeles                 7       0        False
2  Charlie     Artist      Chicago                 6       1         True
3    David       chef      Houston                 4       1        False
```

### 7. Concatenating and Joining Text Data
• **Definition**: You can concatenate or join text data from different columns or DataFrames.
• **Method**:
  - Use `str.cat()` to concatenate strings.

**Example**:
```python
# Concatenating Name and Occupation
df['Name_Occupation'] = df['Name'] + ' is a ' + df['Occupation']
print(df)
```
**Output**:
```
      Name Occupation         City  Occupation_Length  A_Count  Contains_A              Name_Occupation
0    Alice     Artist         NYC                 7       1         True           Alice is a Artist
1      Bob     Artist  Los Angeles                 7       0        False      Bob is a Artist
2  Charlie     Artist      Chicago                 6       1         True      Charlie is a Artist
3    David       chef      Houston                 4       1        False           David is a chef
```
---
---
---

### **Advanced Indexing and MultiIndex**

* Advanced indexing and MultiIndex in Pandas are powerful features that allow for more complex data manipulation and analysis. * They enable you to work with hierarchical data structures and perform sophisticated data selection and aggregation.

### 1. Basic Indexing in Pandas
• **Definition**: Basic indexing allows you to select rows and columns from a DataFrame using labels or integer positions.
• **Methods**:
  - `.loc[]`: Label-based indexing.
  - `.iloc[]`: Integer position-based indexing.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])

# Basic indexing
print(df.loc['row1'])  # Using label
print(df.iloc[0])      # Using integer position
```
**Output**:
```
A    1
B    4
C    7
Name: row1, dtype: int64
```
```
A    1
B    4
C    7
Name: row1, dtype: int64
```

### 2. Advanced Indexing with Boolean Arrays
• **Definition**: You can use boolean arrays to filter data based on conditions.
• **Method**:
  - Use boolean conditions to create a mask.

**Example**:
```python
# Filtering rows based on a condition
filtered_df = df[df['A'] > 1]
print(filtered_df)
```
**Output**:
```
       A  B  C
row2  2  5  8
row3  3  6  9
```

### 3. Setting and Resetting Index
• **Definition**: You can set a specific column as the index of a DataFrame or reset the index to the default integer index.
• **Methods**:
  - `set_index()`: Sets a column as the index.
  - `reset_index()`: Resets the index.

**Example**:
```python
# Setting a column as the index
df_set = df.set_index('A')
print(df_set)

# Resetting the index
df_reset = df_set.reset_index()
print(df_reset)
```
**Output (Set Index)**:
```
     B  C
A        
1  4  7
2  5  8
3  6  9
```
**Output (Reset Index)**:
```
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
```

### 4. MultiIndex
• **Definition**: MultiIndex allows you to create a hierarchical index for a DataFrame, enabling more complex data structures and easier data manipulation.
• **Method**:
  - Use `pd.MultiIndex.from_tuples()` or `pd.MultiIndex.from_product()` to create a MultiIndex.

**Example**:
```python
# Creating a MultiIndex
arrays = [
    ['A', 'A', 'B', 'B'],
    ['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('letter', 'number'))

# Creating a DataFrame with MultiIndex
df_multi = pd.DataFrame({'value': [1, 2, 3, 4]}, index=index)
print(df_multi)
```
**Output**:
```
              value
letter number       
A      one        1
       two        2
B      one        3
       two        4
```

### 5. Accessing Data with MultiIndex
• **Definition**: You can access data in a MultiIndex DataFrame using tuples or the `.loc[]` method.
• **Method**:
  - Use tuples to specify the index levels.

**Example**:
```python
# Accessing data using MultiIndex
print(df_multi.loc['A'])          # Access all rows for 'A'
print(df_multi.loc[('A', 'one')]) # Access specific row
```
**Output**:
```
              value
number             
one             1
two             2
```
```
value    1
Name: (A, one), dtype: int64
```

### 6. Slicing with MultiIndex
• **Definition**: You can slice MultiIndex DataFrames to access a range of data.
• **Method**:
  - Use `.loc[]` with slices.

**Example**:
```python
# Slicing MultiIndex DataFrame
print(df_multi.loc['A':'B'])  # Slicing by the first level
```
**Output**:
```
              value
letter number       
A      one        1
       two        2
B      one        3
       two        4
```

### 7. Swapping and Stacking/Unstacking MultiIndex
• **Definition**: You can swap levels of a MultiIndex or stack/unstack the DataFrame to change its shape.
• **Methods**:
  - `swaplevel()`: Swaps levels in a MultiIndex.
  - `stack()`: Stacks the columns into the index.
  - `unstack()`: Unstacks the index into columns.

**Example**:
```python
# Swapping levels
df_swapped = df_multi.swaplevel()
print(df_swapped)

# Stacking and unstacking
df_stacked = df_multi.unstack()
print(df_stacked)
```
**Output (Swapped Levels)**:
```
              value
number letter       
one             1
two             2
one             3
two             4
```
**Output (Unstacked)**:
```
        value       
number      one two
letter             
A            1   2
B            3   4
```

### 8. Resetting MultiIndex
• **Definition**: You can reset a MultiIndex to convert it back to regular columns.
• **Method**:
  - `reset_index()`: Resets the index of a MultiIndex DataFrame.

**Example**:
```python
# Resetting MultiIndex
df_reset_multi = df_multi.reset_index()
print(df_reset_multi)
```
**Output**:
```
  letter number  value
0      A    one      1
1      A    two      2
2      B    one      3
3      B    two      4
```

---
---
---

### **Data Visualization**

* Data visualization is a crucial aspect of data analysis, allowing you to communicate insights and patterns effectively.
* Pandas provides built-in capabilities for visualizing data using Matplotlib and Seaborn, making it easy to create a variety of plots directly from DataFrames.

### 1. Basic Plotting with Pandas
• **Definition**: Pandas integrates with Matplotlib to provide simple plotting capabilities directly from DataFrames and Series.
• **Method**:
  - Use the `.plot()` method on DataFrames and Series.

**Example**:
```python
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
data = {
    'Year': [2018, 2019, 2020, 2021, 2022],
    'Sales': [150, 200, 250, 300, 350]
}
df = pd.DataFrame(data)

# Basic line plot
df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid()
plt.show()
```
**Output**: A line plot showing sales over the years.

### 2. Different Plot Types
• **Definition**: Pandas supports various plot types, including line, bar, histogram, box, and scatter plots.
• **Method**:
  - Specify the `kind` parameter in the `.plot()` method.

**Example**:
```python
# Bar plot
df.plot(x='Year', y='Sales', kind='bar', title='Sales by Year')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

# Histogram
df['Sales'].plot(kind='hist', bins=5, title='Sales Distribution')
plt.xlabel('Sales')
plt.show()

# Box plot
df.plot(kind='box', title='Sales Box Plot')
plt.show()
```
**Output**:
• A bar plot showing sales by year.
• A histogram showing the distribution of sales.
• A box plot showing the summary statistics of sales.

### 3. Customizing Plots
• **Definition**: You can customize plots by modifying titles, labels, colors, and styles.
• **Method**:
  - Use parameters in the `.plot()` method and Matplotlib functions.

**Example**:
```python
# Customizing a line plot
df.plot(x='Year', y='Sales', kind='line', color='orange', marker='o', linestyle='--', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid()
plt.show()
```
**Output**: A customized line plot with specific colors and styles.

### 4. Subplots
• **Definition**: You can create multiple plots in a single figure using subplots.
• **Method**:
  - Use the `subplot()` function from Matplotlib.

**Example**:
```python
# Creating subplots
fig, axs = plt.subplots(2, 1, figsize=(8, 8))

# Line plot
df.plot(x='Year', y='Sales', kind='line', ax=axs[0], title='Sales Over Years')
axs[0].set_ylabel('Sales')

# Bar plot
df.plot(x='Year', y='Sales', kind='bar', ax=axs[1], title='Sales by Year')
axs[1].set_ylabel('Sales')

plt.tight_layout()
plt.show()
```
**Output**: A figure with two subplots: a line plot and a bar plot.

### 5. Scatter Plots
• **Definition**: Scatter plots are useful for visualizing the relationship between two numerical variables.
• **Method**:
  - Use the `scatter()` method.

**Example**:
```python
# Sample DataFrame with additional data
data = {
    'Year': [2018, 2019, 2020, 2021, 2022],
    'Sales': [150, 200, 250, 300, 350],
    'Profit': [30, 50, 70, 90, 110]
}
df = pd.DataFrame(data)

# Scatter plot
df.plot.scatter(x='Sales', y='Profit', title='Sales vs Profit', color='green')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.show()
```
**Output**: A scatter plot showing the relationship between sales and profit.

### 6. Time Series Plotting
• **Definition**: Time series data can be visualized using line plots to show trends over time.
• **Method**:
  - Use the `.plot()` method on a DataFrame with a datetime index.

**Example**:
```python
# Sample time series data
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
df_time = pd.DataFrame(date_rng, columns=['date'])
df_time['data'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df_time.set_index('date', inplace=True)

# Time series plot
df_time.plot(title='Time Series Data', figsize=(10, 5))
plt.xlabel('Date')
plt.ylabel('Data')
plt.grid()
plt.show()
```
**Output**: A line plot showing the time series data.

### 7. Using Seaborn for Enhanced Visualizations
• **Definition**: Seaborn is a statistical data visualization library built on top of Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.
• **Method**:
  - Use Seaborn functions to create various plots.

**Example**:
```python
import seaborn as sns

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40, 50, 60]
}
df_seaborn = pd.DataFrame(data)

# Box plot using Seaborn
sns.boxplot(x='Category', y='Values', data=df_seaborn)
plt.title('Box Plot of Values by Category')
plt.show()

# Bar plot using Seaborn
sns.barplot(x='Category', y='Values', data=df_seaborn)
plt.title('Bar Plot of Values by Category')
plt.show()
```
**Output**:
• A box plot showing the distribution of values by category.
• A bar plot showing the average values by category.

### 8. Saving Plots
• **Definition**: You can save your plots to files in various formats (e.g., PNG, PDF).
• **Method**:
  - Use the `savefig()` function from Matplotlib.

**Example**:
```python
# Saving a plot
plt.figure(figsize=(10, 5))
df_time.plot(title='Time Series Data')
plt.xlabel('Date')
plt.ylabel('Data')
plt.grid()
plt.savefig('time_series_plot.png')  # Save as PNG
plt.show()
```
**Output**: The plot is saved as a PNG file named `time_series_plot.png`.

---
---
---

### **Performance Optimization (Vectorization , Memory Usage)**

Performance optimization in Pandas is crucial for efficiently handling large datasets and ensuring that data manipulation and analysis tasks are executed quickly. Two key aspects of performance optimization in Pandas are vectorization and memory usage management. Below are the key topics related to performance optimization in Pandas, along with definitions, use cases, and examples.

### 1. Vectorization
• **Definition**: Vectorization refers to the process of applying operations to entire arrays or Series instead of using loops. This takes advantage of low-level optimizations and is generally much faster than iterating through elements one by one.
• **Method**:
  - Use built-in Pandas functions and operations that operate on entire Series or DataFrames.

**Example**:
```python
import pandas as pd
import numpy as np
import time

# Sample DataFrame
n = 10**6  # 1 million rows
df = pd.DataFrame({
    'A': np.random.rand(n),
    'B': np.random.rand(n)
})

# Vectorized operation
start_time = time.time()
df['C'] = df['A'] + df['B']  # Vectorized addition
end_time = time.time()
print(f"Vectorized operation time: {end_time - start_time:.5f} seconds")
```
**Output**: The time taken for the vectorized operation will be printed.

### 2. Avoiding Loops
• **Definition**: Avoid using Python loops (e.g., `for` loops) for operations on DataFrames, as they are significantly slower than vectorized operations.
• **Method**:
  - Use Pandas built-in functions and methods instead of iterating through rows.

**Example**:
```python
# Non-vectorized operation using a loop
start_time = time.time()
df['D'] = 0
for i in range(len(df)):
    df['D'][i] = df['A'][i] * 2  # Non-vectorized multiplication
end_time = time.time()
print(f"Non-vectorized operation time: {end_time - start_time:.5f} seconds")
```
**Output**: The time taken for the non-vectorized operation will be printed, and it will be significantly longer than the vectorized operation.

### 3. Using NumPy for Performance
• **Definition**: NumPy is a powerful library for numerical computations in Python. Using NumPy functions can enhance performance when working with numerical data.
• **Method**:
  - Convert Pandas DataFrames or Series to NumPy arrays for faster computations.

**Example**:
```python
# Using NumPy for calculations
start_time = time.time()
array_A = df['A'].to_numpy()
array_B = df['B'].to_numpy()
df['E'] = array_A * array_B  # Using NumPy for element-wise multiplication
end_time = time.time()
print(f"NumPy operation time: {end_time - start_time:.5f} seconds")
```
**Output**: The time taken for the NumPy operation will be printed.

### 4. Memory Usage Optimization
• **Definition**: Managing memory usage is essential when working with large datasets. Optimizing data types can significantly reduce memory consumption.
• **Method**:
  - Use appropriate data types for columns (e.g., `category`, `float32`, `int8`).

**Example**:
```python
# Checking memory usage
print(df.info(memory_usage='deep'))

# Optimizing data types
df['A'] = df['A'].astype('float32')  # Change to float32
df['B'] = df['B'].astype('float32')  # Change to float32
df['C'] = df['C'].astype('float32')  # Change to float32

# Checking memory usage after optimization
print(df.info(memory_usage='deep'))
```
**Output**: The memory usage before and after optimization will be printed, showing a reduction in memory consumption.

### 5. Using `query()` for Filtering
• **Definition**: The `query()` method allows for efficient filtering of DataFrames using a query string, which can be faster than traditional boolean indexing.
• **Method**:
  - Use `query()` to filter rows based on conditions.

**Example**:
```python
# Filtering using query
start_time = time.time()
filtered_df = df.query('A > 0.5 and B < 0.5')
end_time = time.time()
print(f"Query operation time: {end_time - start_time:.5f} seconds")
```
**Output**: The time taken for the query operation will be printed.

### 6. Using `apply()` Efficiently
• **Definition**: The `apply()` method can be used for applying functions along the axis of a DataFrame. However, it can be slower than vectorized operations.
• **Method**:
  - Use `apply()` only when necessary and prefer vectorized functions when possible.

**Example**:
```python
# Using apply (less efficient)
start_time = time.time()
df['F'] = df['A'].apply(lambda x: x * 2)  # Using apply
end_time = time.time()
print(f"Apply operation time: {end_time - start_time:.5f} seconds")
```
**Output**: The time taken for the apply operation will be printed, and it will be longer than the vectorized operations.

### 7. Profiling and Benchmarking
• **Definition**: Profiling helps identify bottlenecks in your code, allowing you to optimize performance effectively.
• **Method**:
  - Use libraries like `line_profiler` or `memory_profiler` to analyze performance.

**Example**:
```python
# Example of profiling (requires installation of line_profiler)
# Use the following command in the terminal to install:
# pip install line_profiler

# @profile
def example_function():
    df['G'] = df['A'] + df['B']  # Example operation

# Call the function to profile
example_function()
```
**Output**: Profiling results will show the time taken for each line of code.

---
---
---

### **Exporting and Importing Data**

* Exporting and importing data in Pandas is essential for data analysis workflows, allowing you to read data from various file formats and save processed data back to files.
* Pandas provides built-in functions to handle a wide range of data formats, including CSV, Excel, JSON, SQL databases, and more.

### 1. Importing Data

#### 1.1 Importing CSV Files
• **Definition**: CSV (Comma-Separated Values) is a common file format for storing tabular data.
• **Method**:
  - Use `pd.read_csv()` to read CSV files into a DataFrame.

**Example**:
```python
import pandas as pd

# Importing a CSV file
df_csv = pd.read_csv('data.csv')
print(df_csv.head())
```

#### 1.2 Importing Excel Files
• **Definition**: Excel files are widely used for data storage and analysis.
• **Method**:
  - Use `pd.read_excel()` to read Excel files into a DataFrame.

**Example**:
```python
# Importing an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df_excel.head())
```

#### 1.3 Importing JSON Files
• **Definition**: JSON (JavaScript Object Notation) is a lightweight data interchange format.
• **Method**:
  - Use `pd.read_json()` to read JSON files into a DataFrame.

**Example**:
```python
# Importing a JSON file
df_json = pd.read_json('data.json')
print(df_json.head())
```

#### 1.4 Importing from SQL Databases
• **Definition**: You can read data from SQL databases using SQL queries.
• **Method**:
  - Use `pd.read_sql()` to read data from a SQL database.

**Example**:
```python
import sqlite3

# Connecting to a SQLite database
conn = sqlite3.connect('database.db')

# Importing data from a SQL table
df_sql = pd.read_sql('SELECT * FROM table_name', conn)
print(df_sql.head())

# Closing the connection
conn.close()
```

### 2. Exporting Data

#### 2.1 Exporting to CSV Files
• **Definition**: You can save a DataFrame to a CSV file.
• **Method**:
  - Use `df.to_csv()` to export DataFrames to CSV files.

**Example**:
```python
# Exporting DataFrame to a CSV file
df_csv.to_csv('output.csv', index=False)  # index=False to avoid writing row indices
```

#### 2.2 Exporting to Excel Files
• **Definition**: You can save a DataFrame to an Excel file.
• **Method**:
  - Use `df.to_excel()` to export DataFrames to Excel files.

**Example**:
```python
# Exporting DataFrame to an Excel file
df_excel.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
```

#### 2.3 Exporting to JSON Files
• **Definition**: You can save a DataFrame to a JSON file.
• **Method**:
  - Use `df.to_json()` to export DataFrames to JSON files.

**Example**:
```python
# Exporting DataFrame to a JSON file
df_json.to_json('output.json', orient='records', lines=True)
```

#### 2.4 Exporting to SQL Databases
• **Definition**: You can save a DataFrame to a SQL database.
• **Method**:
  - Use `df.to_sql()` to write DataFrames to a SQL table.

**Example**:
```python
# Connecting to a SQLite database
conn = sqlite3.connect('database.db')

# Exporting DataFrame to a SQL table
df_sql.to_sql('table_name', conn, if_exists='replace', index=False)

# Closing the connection
conn.close()
```

### 3. Handling File Paths
• **Definition**: When importing or exporting files, you can specify relative or absolute file paths.
• **Method**:
  - Use appropriate file paths based on your working directory.

**Example**:
```python
# Importing from a relative path
df_relative = pd.read_csv('./data/data.csv')

# Exporting to an absolute path
df_csv.to_csv('/Users/username/Documents/output.csv', index=False)
```

### 4. Additional Options
• **Definition**: When importing or exporting data, you can specify additional options to customize the process.
• **Method**:
  - Use parameters like `sep`, `header`, `encoding`, and `na_values` in `read_csv()`, and `index`, `header`, and `sheet_name` in `to_excel()`.

**Example**:
```python
# Importing a CSV file with custom separator and encoding
df_custom = pd.read_csv('data.csv', sep=';', encoding='utf-8')

# Exporting to CSV with custom options
df_csv.to_csv('output.csv', index=False, header=True, na_rep='NA')
```

---
---
---

### **Handling Duplicates**

* Handling duplicates in Pandas is an essential part of data cleaning and preprocessing.
* Duplicate entries can lead to inaccurate analysis and insights, so it's important to identify and manage them effectively.

### 1. Identifying Duplicates
• **Definition**: You can identify duplicate rows in a DataFrame using the `duplicated()` method, which returns a boolean Series indicating whether each row is a duplicate.
• **Method**:
  - `DataFrame.duplicated()`: Checks for duplicate rows.

**Example**:
```python
import pandas as pd

# Sample DataFrame with duplicates
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
    'Age': [24, 30, 22, 24, 35, 30],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston', 'Los Angeles']
}
df = pd.DataFrame(data)

# Identifying duplicates
duplicates = df.duplicated()
print(duplicates)
```
**Output**:
```
0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool
```

### 2. Counting Duplicates
• **Definition**: You can count the number of duplicate rows in a DataFrame using the `sum()` function on the boolean Series returned by `duplicated()`.
• **Method**:
  - Use `sum()` to count duplicates.

**Example**:
```python
# Counting duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")
```
**Output**:
```
Number of duplicate rows: 2
```

### 3. Removing Duplicates
• **Definition**: You can remove duplicate rows from a DataFrame using the `drop_duplicates()` method.
• **Method**:
  - `DataFrame.drop_duplicates()`: Removes duplicate rows.

**Example**:
```python
# Removing duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
```
**Output**:
```
      Name  Age         City
0    Alice   24     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago
4    David   35      Houston
```

### 4. Removing Duplicates Based on Specific Columns
• **Definition**: You can specify certain columns to consider when identifying duplicates.
• **Method**:
  - Use the `subset` parameter in `drop_duplicates()`.

**Example**:
```python
# Removing duplicates based on the 'Name' column
df_no_duplicates_name = df.drop_duplicates(subset='Name')
print(df_no_duplicates_name)
```
**Output**:
```
      Name  Age         City
0    Alice   24     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago
4    David   35      Houston
```

### 5. Keeping Specific Duplicates
• **Definition**: You can choose to keep the first or last occurrence of duplicates when removing them.
• **Method**:
  - Use the `keep` parameter in `drop_duplicates()`.

**Example**:
```python
# Keeping the last occurrence of duplicates
df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)
```
**Output**:
```
      Name  Age         City
0    Alice   24     New York
2  Charlie   22      Chicago
4    David   35      Houston
5      Bob   30  Los Angeles
```

### 6. Resetting Index After Removing Duplicates
• **Definition**: After removing duplicates, the index may not be sequential. You can reset the index using `reset_index()`.
• **Method**:
  - Use `reset_index(drop=True)` to reset the index.

**Example**:
```python
# Resetting index after removing duplicates
df_reset_index = df_no_duplicates.reset_index(drop=True)
print(df_reset_index)
```
**Output**:
```
      Name  Age         City
0    Alice   24     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago
3    David   35      Houston
```

### 7. Finding Duplicate Rows with Specific Conditions
• **Definition**: You can find duplicates based on specific conditions using boolean indexing.
• **Method**:
  - Combine `duplicated()` with boolean indexing.

**Example**:
```python
# Finding duplicates with specific conditions
duplicates_condition = df[df.duplicated(subset=['Name', 'City'], keep=False)]
print(duplicates_condition)
```
**Output**:
```
      Name  Age         City
0    Alice   24     New York
3    Alice   24     New York
1      Bob   30  Los Angeles
5      Bob   30  Los Angeles
```

---
---
---

### **Applying functions (apply , map)**

* In Pandas, applying functions to DataFrames and Series is a common operation that allows for flexible data manipulation and transformation. The `apply()` and `map()` methods are two powerful tools for applying functions to data.

### 1. Using `apply()`
• **Definition**: The `apply()` method allows you to apply a function along an axis of the DataFrame (rows or columns) or to a Series. It can be used for both element-wise operations and aggregations.
• **Method**:
  - `DataFrame.apply(func, axis=0)`: Applies a function along the specified axis (0 for columns, 1 for rows).

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data)

# Applying a function to each column (default axis=0)
column_sum = df.apply(sum)
print(column_sum)

# Applying a function to each row (axis=1)
row_sum = df.apply(sum, axis=1)
print(row_sum)
```
**Output**:
```
A    6
B    15
C    24
dtype: int64
```
```
0    12
1    15
2    18
dtype: int64
```

### 2. Applying Custom Functions
• **Definition**: You can define and apply custom functions using `apply()`.
• **Method**:
  - Pass a custom function to `apply()`.

**Example**:
```python
# Custom function to calculate the range
def calculate_range(x):
    return x.max() - x.min()

# Applying the custom function to each column
range_result = df.apply(calculate_range)
print(range_result)
```
**Output**:
```
A    2
B    2
C    2
dtype: int64
```

### 3. Using `map()`
• **Definition**: The `map()` method is used primarily with Series to apply a function or mapping correspondence to each element. It is particularly useful for element-wise transformations.
• **Method**:
  - `Series.map(func)`: Applies a function to each element in the Series.

**Example**:
```python
# Sample Series
s = pd.Series([1, 2, 3, 4])

# Using map to apply a function
squared = s.map(lambda x: x ** 2)
print(squared)
```
**Output**:
```
0     1
1     4
2     9
3    16
dtype: int64
```

### 4. Mapping with Dictionaries
• **Definition**: You can use `map()` with a dictionary to replace values in a Series based on a mapping.
• **Method**:
  - Pass a dictionary to `map()`.

**Example**:
```python
# Sample Series with categorical data
s_categories = pd.Series(['cat', 'dog', 'cat', 'bird'])

# Mapping categories to numerical values
category_mapping = {'cat': 1, 'dog': 2, 'bird': 3}
mapped_values = s_categories.map(category_mapping)
print(mapped_values)
```
**Output**:
```
0    1.0
1    2.0
2    1.0
3    3.0
dtype: float64
```

### 5. Differences Between `apply()` and `map()`
• **Definition**: While both `apply()` and `map()` are used to apply functions, they have different use cases:
  - `apply()`: Can be used on both DataFrames and Series, and can apply functions along rows or columns.
  - `map()`: Primarily used with Series for element-wise operations and transformations.

### 6. Performance Considerations
• **Definition**: Vectorized operations (using built-in functions) are generally faster than using `apply()` or `map()`. When possible, prefer using vectorized operations for better performance.
• **Method**:
  - Use built-in Pandas functions for operations instead of `apply()` or `map()` when applicable.

**Example**:
```python
# Vectorized operation for squaring
vectorized_squared = s ** 2
print(vectorized_squared)
```
**Output**:
```
0     1
1     4
2     9
3    16
dtype: int64
```
---
---
---

### **Categorical Data**

* Categorical data in Pandas is a data type that represents a limited, fixed number of possible values (categories). This type of data is useful for representing qualitative data, such as labels, categories, or groups.
* Categorical data can lead to more efficient memory usage and faster computations compared to using regular object types.

### 1. Creating Categorical Data
• **Definition**: You can create categorical data using the `Categorical` data type in Pandas.
• **Method**:
  - Use `pd.Categorical()` or specify the `dtype='category'` when creating a DataFrame.

**Example**:
```python
import pandas as pd

# Creating a DataFrame with categorical data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Gender': ['Female', 'Male', 'Male', 'Male'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)

# Converting 'Gender' column to categorical
df['Gender'] = pd.Categorical(df['Gender'])
print(df['Gender'].dtype)
```
**Output**:
```
category
```

### 2. Specifying Categories
• **Definition**: You can specify the categories explicitly when creating a categorical variable.
• **Method**:
  - Use the `categories` parameter in `pd.Categorical()`.

**Example**:
```python
# Specifying categories
df['Gender'] = pd.Categorical(df['Gender'], categories=['Male', 'Female'], ordered=True)
print(df['Gender'])
```
**Output**:
```
0      Female
1        Male
2        Male
3        Male
dtype: category
Categories (2, object): ['Male' < 'Female']
```

### 3. Benefits of Categorical Data
• **Definition**: Categorical data can lead to more efficient memory usage and faster computations, especially when dealing with large datasets.
• **Method**:
  - Use `df.info()` to compare memory usage.

**Example**:
```python
# Original DataFrame
print(df.info(memory_usage='deep'))

# Converting 'City' column to categorical
df['City'] = pd.Categorical(df['City'])
print(df.info(memory_usage='deep'))
```
**Output**:
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object
 1   Gender  4 non-null      category
 2   City    4 non-null      object
dtypes: category(1), object(2)
memory usage: 1.1 KB
```
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Name    4 non-null      object  
 1   Gender  4 non-null      category
 2   City    4 non-null      category
dtypes: category(2), object(1)
memory usage: 1.0 KB
```

### 4. Operations on Categorical Data
• **Definition**: You can perform various operations on categorical data, such as sorting, filtering, and grouping.
• **Method**:
  - Use standard DataFrame operations.

**Example**:
```python
# Sorting by categorical data
df_sorted = df.sort_values(by='Gender')
print(df_sorted)

# Grouping by categorical data
grouped = df.groupby('Gender').size()
print(grouped)
```
**Output (Sorted)**:
```
      Name  Gender         City
0    Alice  Female     New York
1      Bob    Male  Los Angeles
2  Charlie    Male      Chicago
3    David    Male      Houston
```
```
Gender
Female    1
Male      3
dtype: int64
```

### 5. Encoding Categorical Data
• **Definition**: Categorical data can be encoded into numerical values for machine learning models.
• **Method**:
  - Use `pd.get_dummies()` for one-hot encoding or `cat.codes` for label encoding.

**Example**:
```python
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['Gender'])
print(df_encoded)

# Label encoding
df['Gender_Code'] = df['Gender'].cat.codes
print(df)
```
**Output (One-Hot Encoding)**:
```
      Name         City  Gender_Female  Gender_Male
0    Alice     New York               1             0
1      Bob  Los Angeles               0             1
2  Charlie      Chicago               0             1
3    David      Houston               0             1
```
```
      Name  Gender         City  Gender_Code
0    Alice  Female     New York            1
1      Bob    Male  Los Angeles            0
2  Charlie    Male      Chicago            0
3    David    Male      Houston            0
```

### 6. Handling Missing Values in Categorical Data
• **Definition**: You can handle missing values in categorical data using methods like `fillna()` or `replace()`.
• **Method**:
  - Use `fillna()` to replace missing values.

**Example**:
```python
# Introducing missing values
df.loc[1, 'Gender'] = None

# Filling missing values with a specific category
df['Gender'] = df['Gender'].fillna('Unknown')
print(df)
```
**Output**:
```
      Name  Gender         City
0    Alice  Female     New York
1      Bob  Unknown  Los Angeles
2  Charlie    Male      Chicago
3    David    Male      Houston
```

### 7. Plotting Categorical Data
• **Definition**: You can visualize categorical data using bar plots or count plots.
• **Method**:
  - Use Matplotlib or Seaborn for visualization.

**Example**:
```python
import seaborn as sns
import matplotlib.pyplot as plt

# Bar plot of gender counts
sns.countplot(x='Gender', data=df)
plt.title('Count of Genders')
plt.show()
```
**Output**: A bar plot showing the count of each gender category.

---
---
---

### **Working with JSON and HTML Data**

* Pandas provides robust functionality for working with JSON and HTML data, allowing you to easily read, manipulate, and analyze data from these formats.

### Working with JSON Data

#### 1. Importing JSON Data
• **Definition**: JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy to read and write for humans and machines.
• **Method**:
  - Use `pd.read_json()` to read JSON data into a DataFrame.

**Example**:
```python
import pandas as pd

# Sample JSON data as a string
json_data = '''
[
    {"Name": "Alice", "Age": 24, "City": "New York"},
    {"Name": "Bob", "Age": 30, "City": "Los Angeles"},
    {"Name": "Charlie", "Age": 22, "City": "Chicago"}
]
'''

# Importing JSON data
df_json = pd.read_json(json_data)
print(df_json)
```
**Output**:
```
      Name  Age         City
0    Alice   24     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago
```

#### 2. Importing JSON from a File
• **Definition**: You can also read JSON data from a file.
• **Method**:
  - Use `pd.read_json()` with a file path.

**Example**:
```python
# Assuming 'data.json' contains valid JSON data
df_json_file = pd.read_json('data.json')
print(df_json_file)
```

#### 3. Exporting Data to JSON
• **Definition**: You can save a DataFrame to a JSON file.
• **Method**:
  - Use `DataFrame.to_json()` to export DataFrames to JSON format.

**Example**:
```python
# Exporting DataFrame to a JSON file
df_json.to_json('output.json', orient='records', lines=True)
```

### Working with HTML Data

#### 1. Importing HTML Data
• **Definition**: HTML (HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser. Pandas can read tables from HTML pages.
• **Method**:
  - Use `pd.read_html()` to read HTML tables into a list of DataFrames.

**Example**:
```python
# Importing HTML data from a URL
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
dfs = pd.read_html(url)

# Displaying the first DataFrame (the first table on the page)
df_html = dfs[0]
print(df_html.head())
```
**Output**: The first few rows of the DataFrame containing population data.

#### 2. Importing HTML from a Local File
• **Definition**: You can also read HTML tables from a local HTML file.
• **Method**:
  - Use `pd.read_html()` with a file path.

**Example**:
```python
# Assuming 'data.html' contains valid HTML tables
dfs_local = pd.read_html('data.html')

# Displaying the first DataFrame
df_html_local = dfs_local[0]
print(df_html_local.head())
```

#### 3. Exporting Data to HTML
• **Definition**: You can save a DataFrame to an HTML file.
• **Method**:
  - Use `DataFrame.to_html()` to export DataFrames to HTML format.

**Example**:
```python
# Exporting DataFrame to an HTML file
df_json.to_html('output.html', index=False)
```

### 4. Handling Nested JSON Data
• **Definition**: JSON data can often be nested, requiring additional processing to flatten it into a DataFrame.
• **Method**:
  - Use the `json_normalize()` function from Pandas.

**Example**:
```python
# Sample nested JSON data
nested_json = '''
[
    {"Name": "Alice", "Info": {"Age": 24, "City": "New York"}},
    {"Name": "Bob", "Info": {"Age": 30, "City": "Los Angeles"}},
    {"Name": "Charlie", "Info": {"Age": 22, "City": "Chicago"}}
]
'''

# Importing nested JSON data
data = pd.read_json(nested_json)

# Normalizing the nested JSON
df_normalized = pd.json_normalize(data.to_dict(orient='records'))
print(df_normalized)
```
**Output**:
```
      Name  Info.Age         Info.City
0    Alice       24          New York
1      Bob       30      Los Angeles
2  Charlie       22          Chicago
```

---
---
---

### **Working with Large Dataset**

* Working with large datasets in Pandas can be challenging due to memory constraints and performance issues. However, Pandas provides several techniques and best practices to efficiently handle large datasets.

### 1. Reading Large Datasets

#### 1.1 Using `chunksize`
• **Definition**: When reading large files, you can use the `chunksize` parameter to read the data in smaller chunks, which helps manage memory usage.
• **Method**:
  - Use `pd.read_csv()` or other read functions with the `chunksize` parameter.

**Example**:
```python
import pandas as pd

# Reading a large CSV file in chunks
chunk_size = 10000  # Number of rows per chunk
chunks = pd.read_csv('large_data.csv', chunksize=chunk_size)

# Processing each chunk
for chunk in chunks:
    # Perform operations on each chunk
    print(chunk.head())
```

### 2. Optimizing Data Types
• **Definition**: Using appropriate data types can significantly reduce memory usage. For example, using `float32` instead of `float64` or `category` for categorical data.
• **Method**:
  - Use `astype()` to convert data types.

**Example**:
```python
# Sample DataFrame with large numeric values
data = {
    'A': [1.0, 2.0, 3.0] * 10**6,
    'B': ['cat', 'dog', 'bird'] * 10**6
}
df = pd.DataFrame(data)

# Optimizing data types
df['A'] = df['A'].astype('float32')  # Change to float32
df['B'] = df['B'].astype('category')  # Change to category
print(df.info(memory_usage='deep'))
```

### 3. Using `dask` for Out-of-Core Computation
• **Definition**: Dask is a parallel computing library that integrates with Pandas and allows you to work with larger-than-memory datasets by breaking them into smaller chunks.
• **Method**:
  - Use `dask.dataframe` to create a Dask DataFrame.

**Example**:
```python
import dask.dataframe as dd

# Reading a large CSV file with Dask
ddf = dd.read_csv('large_data.csv')

# Perform operations on Dask DataFrame
result = ddf.groupby('column_name').mean().compute()  # Use compute() to get the result
print(result)
```

### 4. Filtering Data Efficiently
• **Definition**: When working with large datasets, filtering data efficiently can help reduce the size of the DataFrame and improve performance.
• **Method**:
  - Use boolean indexing to filter rows.

**Example**:
```python
# Filtering rows based on a condition
filtered_df = df[df['A'] > 1.5]
print(filtered_df)
```

### 5. Using `query()` for Efficient Filtering
• **Definition**: The `query()` method allows for efficient filtering of DataFrames using a query string, which can be faster than traditional boolean indexing.
• **Method**:
  - Use `query()` to filter rows based on conditions.

**Example**:
```python
# Using query to filter data
filtered_query = df.query('A > 1.5')
print(filtered_query)
```

### 6. Aggregating Data Efficiently
• **Definition**: When working with large datasets, aggregating data can help summarize information and reduce the size of the DataFrame.
• **Method**:
  - Use `groupby()` and aggregation functions.

**Example**:
```python
# Aggregating data
aggregated_data = df.groupby('B').mean()
print(aggregated_data)
```

### 7. Writing Large Datasets
• **Definition**: When exporting large datasets, you can use the `chunksize` parameter to write data in smaller chunks, which helps manage memory usage.
• **Method**:
  - Use `to_csv()` or other write functions with the `chunksize` parameter.

**Example**:
```python
# Writing a large DataFrame to CSV in chunks
df.to_csv('output_large_data.csv', index=False, chunksize=10000)
```

### 8. Profiling Memory Usage
• **Definition**: Profiling memory usage helps identify bottlenecks and optimize performance when working with large datasets.
• **Method**:
  - Use the `memory_usage()` method to check memory consumption.

**Example**:
```python
# Checking memory usage
print(df.memory_usage(deep=True))
```

---
---
---

### **DataFrame Transformations**

* DataFrame transformations in Pandas refer to the various methods and techniques used to modify, reshape, or manipulate data within a DataFrame.
* These transformations can include operations such as filtering, aggregating, pivoting, melting, and applying functions.

### 1. Filtering Data
• **Definition**: Filtering allows you to select specific rows from a DataFrame based on certain conditions.
• **Method**:
  - Use boolean indexing to filter rows.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 30, 22, 35],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)

# Filtering rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)
```
**Output**:
```
    Name  Age         City
1    Bob   30  Los Angeles
3  David   35      Houston
```

### 2. Adding New Columns
• **Definition**: You can add new columns to a DataFrame based on existing data or calculations.
• **Method**:
  - Assign a new Series or calculation to a new column name.

**Example**:
```python
# Adding a new column for age in months
df['Age_in_Months'] = df['Age'] * 12
print(df)
```
**Output**:
```
      Name  Age         City  Age_in_Months
0    Alice   24     New York             288
1      Bob   30  Los Angeles             360
2  Charlie   22      Chicago             264
3    David   35      Houston             420
```

### 3. Dropping Columns
• **Definition**: You can remove columns from a DataFrame that are no longer needed.
• **Method**:
  - Use `drop()` to remove specified columns.

**Example**:
```python
# Dropping the 'City' column
df_dropped = df.drop(columns=['City'])
print(df_dropped)
```
**Output**:
```
      Name  Age  Age_in_Months
0    Alice   24             288
1      Bob   30             360
2  Charlie   22             264
3    David   35             420
```

### 4. Renaming Columns
• **Definition**: You can rename columns in a DataFrame for better clarity or consistency.
• **Method**:
  - Use `rename()` to change column names.

**Example**:
```python
# Renaming columns
df_renamed = df.rename(columns={'Age': 'Years', 'City': 'Location'})
print(df_renamed)
```
**Output**:
```
      Name  Years     Location  Age_in_Months
0    Alice     24     New York             288
1      Bob     30  Los Angeles             360
2  Charlie     22      Chicago             264
3    David     35      Houston             420
```

### 5. Aggregating Data
• **Definition**: Aggregation involves summarizing data, such as calculating the mean, sum, or count for groups of data.
• **Method**:
  - Use `groupby()` followed by an aggregation function.

**Example**:
```python
# Aggregating data by city
data_agg = df.groupby('City')['Age'].mean()
print(data_agg)
```
**Output**:
```
City
Chicago        22.0
Houston        35.0
Los Angeles    30.0
New York       24.0
Name: Age, dtype: float64
```

### 6. Pivoting Data
• **Definition**: Pivoting allows you to reshape data by turning unique values from one column into separate columns.
• **Method**:
  - Use `pivot_table()` to create a pivot table.

**Example**:
```python
# Sample DataFrame for pivoting
data_pivot = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40]
}
df_pivot = pd.DataFrame(data_pivot)

# Creating a pivot table
pivot_table = df_pivot.pivot_table(values='Values', index='Date', columns='Category', aggfunc='sum', fill_value=0)
print(pivot_table)
```
**Output**:
```
Category         A   B
Date                  
2023-01-01     10  20
2023-01-02     30  40
```

### 7. Melting Data
• **Definition**: Melting transforms a DataFrame from wide format to long format, unpivoting the DataFrame.
• **Method**:
  - Use `melt()` to reshape the DataFrame.

**Example**:
```python
# Melting the pivot table back to long format
melted_df = pd.melt(pivot_table.reset_index(), id_vars='Date', value_vars=['A', 'B'])
print(melted_df)
```
**Output**:
```
         Date Category  Values
0  2023-01-01        A      10
1  2023-01-01        B      20
2  2023-01-02        A      30
3  2023-01-02        B      40
```

### 8. Applying Functions
• **Definition**: You can apply custom functions to DataFrames or Series using `apply()` or `map()`.
• **Method**:
  - Use `apply()` for row-wise or column-wise operations.

**Example**:
```python
# Applying a function to calculate the square of values in 'Values' column
df_pivot['Values_Squared'] = df_pivot['Values'].apply(lambda x: x ** 2)
print(df_pivot)
```
**Output**:
```
Category         A   B  Values_Squared
Date                                      
2023-01-01     10  20              100
2023-01-02     30  40            900
```

### 9. Sorting Data
• **Definition**: You can sort a DataFrame by one or more columns.
• **Method**:
  - Use `sort_values()` to sort the DataFrame.

**Example**:
```python
# Sorting by 'Age'
df_sorted = df.sort_values(by='Age', ascending=False)
print(df_sorted)
```
**Output**:
```
      Name  Age         City
3    David   35      Houston
1      Bob   30  Los Angeles
0    Alice   24     New York
2  Charlie   22      Chicago
```

---
---
---

### **Handling Outliers**

Handling outliers in Pandas is an important aspect of data preprocessing and cleaning. Outliers can skew your analysis and lead to misleading results, so it's essential to identify and manage them appropriately. Below are key topics related to handling outliers in Pandas, along with definitions, use cases, and examples.

### 1. Identifying Outliers
• **Definition**: Outliers are data points that differ significantly from other observations. They can be identified using statistical methods such as the Z-score or the Interquartile Range (IQR).
• **Method**:
  - Use Z-score or IQR to detect outliers.

**Example using IQR**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Values': [10, 12, 12, 13, 12, 14, 15, 100, 12, 13, 14, 15]
}
df = pd.DataFrame(data)

# Calculating IQR
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1

# Defining bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identifying outliers
outliers = df[(df['Values'] < lower_bound) | (df['Values'] > upper_bound)]
print("Outliers:")
print(outliers)
```
**Output**:
```
Outliers:
   Values
7     100
```

### 2. Visualizing Outliers
• **Definition**: Visualizing data can help identify outliers more intuitively.
• **Method**:
  - Use box plots or scatter plots to visualize outliers.

**Example**:
```python
import matplotlib.pyplot as plt
import seaborn as sns

# Box plot to visualize outliers
sns.boxplot(x=df['Values'])
plt.title('Box Plot of Values')
plt.show()
```
**Output**: A box plot showing the distribution of values and highlighting outliers.

### 3. Handling Outliers
• **Definition**: Once identified, you can handle outliers in several ways, including removal, capping, or transformation.
• **Methods**:
  - **Removal**: Simply drop the outlier rows.
  - **Capping**: Replace outliers with a specified threshold.
  - **Transformation**: Apply transformations to reduce the impact of outliers.

**Example of Removal**:
```python
# Removing outliers
df_no_outliers = df[(df['Values'] >= lower_bound) & (df['Values'] <= upper_bound)]
print("DataFrame after removing outliers:")
print(df_no_outliers)
```
**Output**:
```
DataFrame after removing outliers:
   Values
0      10
1      12
2      12
3      13
4      12
5      14
6      15
8      12
9      13
10     14
11     15
```

**Example of Capping**:
```python
# Capping outliers
df['Capped_Values'] = df['Values'].clip(lower=lower_bound, upper=upper_bound)
print("DataFrame after capping outliers:")
print(df)
```
**Output**:
```
   Values  Capped_Values
0      10             10
1      12             12
2      12             12
3      13             13
4      12             12
5      14             14
6      15             15
7     100             15
8      12             12
9      13             13
10     14             14
11     15             15
```

### 4. Transforming Data
• **Definition**: Applying transformations such as logarithmic or square root transformations can help reduce the impact of outliers.
• **Method**:
  - Use mathematical transformations to modify the data.

**Example**:
```python
# Applying a logarithmic transformation
df['Log_Values'] = df['Values'].apply(lambda x: np.log(x) if x > 0 else 0)
print("DataFrame after logarithmic transformation:")
print(df)
```
**Output**:
```
   Values  Log_Values
0      10     2.302585
1      12     2.484907
2      12     2.484907
3      13     2.564949
4      12     2.484907
5      14     2.639057
6      15     2.708050
7     100     4.605170
8      12     2.484907
9      13     2.564949
10     14     2.639057
11     15     2.708050
```

### 5. Summary Statistics Without Outliers
• **Definition**: You can calculate summary statistics while excluding outliers to get a better understanding of the central tendency and dispersion of the data.
• **Method**:
  - Use the `describe()` method on the DataFrame without outliers.

**Example**:
```python
# Summary statistics without outliers
summary_stats = df_no_outliers.describe()
print("Summary statistics without outliers:")
print(summary_stats)
```
**Output**:
```
       Values
count  10.000000
mean   12.800000
std     1.516575
min    10.000000
25%    12.000000
50%    12.500000
75%    13.500000
max    15.000000
```

---
---
---

### **Pivot and Unpivot Data**

* Pivoting and unpivoting (or melting) data in Pandas are essential techniques for reshaping data to facilitate analysis and visualization.
* These operations allow you to transform data from a wide format to a long format and vice versa.

### 1. Pivoting Data
• **Definition**: Pivoting transforms a DataFrame from a long format to a wide format by turning unique values from one column into separate columns.
• **Method**:
  - Use `pivot()` or `pivot_table()` to create a pivot table.

#### 1.1 Using `pivot()`
• **Definition**: The `pivot()` method is used to reshape data based on unique values from specified columns.
• **Method**:
  - `DataFrame.pivot(index, columns, values)`: Reshapes the DataFrame.

**Example**:
```python
import pandas as pd

# Sample DataFrame
data = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40]
}
df = pd.DataFrame(data)

# Creating a pivot table
pivot_table = df.pivot(index='Date', columns='Category', values='Values')
print(pivot_table)
```
**Output**:
```
Category         A   B
Date                  
2023-01-01     10  20
2023-01-02     30  40
```

#### 1.2 Using `pivot_table()`
• **Definition**: The `pivot_table()` method is more flexible than `pivot()` and allows for aggregation of values.
• **Method**:
  - `DataFrame.pivot_table(index, columns, values, aggfunc)`: Creates a pivot table with aggregation.

**Example**:
```python
# Sample DataFrame with duplicate entries
data = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40]
}
df = pd.DataFrame(data)

# Creating a pivot table with aggregation
pivot_table_agg = df.pivot_table(index='Date', columns='Category', values='Values', aggfunc='sum', fill_value=0)
print(pivot_table_agg)
```
**Output**:
```
Category         A   B
Date                  
2023-01-01     10  20
2023-01-02     30  40
```

### 2. Unpivoting Data (Melting)
• **Definition**: Unpivoting (or melting) transforms a DataFrame from a wide format to a long format by turning columns into rows.
• **Method**:
  - Use `melt()` to reshape the DataFrame.

**Example**:
```python
# Sample pivot table DataFrame
pivot_data = {
    'Date': ['2023-01-01', '2023-01-02'],
    'A': [10, 30],
    'B': [20, 40]
}
df_pivot = pd.DataFrame(pivot_data)

# Melting the DataFrame back to long format
melted_df = pd.melt(df_pivot, id_vars='Date', value_vars=['A', 'B'], var_name='Category', value_name='Values')
print(melted_df)
```
**Output**:
```
         Date Category  Values
0  2023-01-01        A      10
1  2023-01-02        A      30
2  2023-01-01        B      20
3  2023-01-02        B      40
```

### 3. Using `melt()` with Additional Parameters
• **Definition**: You can specify additional parameters in `melt()` to customize the transformation.
• **Method**:
  - Use `var_name` and `value_name` to rename the resulting columns.

**Example**:
```python
# Melting with custom column names
melted_custom = pd.melt(df_pivot, id_vars='Date', var_name='Category', value_name='Value')
print(melted_custom)
```
**Output**:
```
         Date Category  Value
0  2023-01-01        A     10
1  2023-01-02        A     30
2  2023-01-01        B     20
3  2023-01-02        B     40
```

### 4. Pivoting with Multiple Indexes
• **Definition**: You can create a pivot table with multiple index levels for more complex data structures.
• **Method**:
  - Use lists for the `index` parameter in `pivot_table()`.

**Example**:
```python
# Sample DataFrame with multiple categories
data_multi = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Category': ['A', 'B', 'A', 'B'],
    'Subcategory': ['X', 'Y', 'X', 'Y'],
    'Values': [10, 20, 30, 40]
}
df_multi = pd.DataFrame(data_multi)

# Creating a pivot table with multiple indexes
pivot_multi = df_multi.pivot_table(index=['Date', 'Category'], columns='Subcategory', values='Values', aggfunc='sum', fill_value=0)
print(pivot_multi)
```
**Output**:
```
Subcategory              X   Y
Date       Category          
2023-01-01 A         10   0
           B          0  20
2023-01-02 A         30   0
           B          0  40
```

---
---
---

### **Time Zone Handling**

Handling time zones in Pandas is crucial for working with time series data that spans multiple time zones or requires accurate time representation. Pandas provides robust functionality for localizing, converting, and manipulating time zone-aware datetime objects. Below are key topics related to time zone handling in Pandas, along with definitions, use cases, and examples.

### 1. Creating Time Zone-Aware Datetime Objects
• **Definition**: You can create datetime objects that are aware of time zones using the `pd.to_datetime()` function or by localizing naive datetime objects.
• **Method**:
  - Use `pd.to_datetime()` with the `utc` parameter or `tz_localize()` to set a time zone.

**Example**:
```python
import pandas as pd

# Creating a naive datetime object
naive_datetime = pd.to_datetime('2023-01-01 12:00:00')
print("Naive datetime:", naive_datetime)

# Localizing to UTC
utc_datetime = naive_datetime.tz_localize('UTC')
print("UTC datetime:", utc_datetime)
```
**Output**:
```
Naive datetime: 2023-01-01 12:00:00
UTC datetime: 2023-01-01 12:00:00+00:00
```

### 2. Converting Time Zones
• **Definition**: You can convert time zone-aware datetime objects from one time zone to another using the `tz_convert()` method.
• **Method**:
  - Use `tz_convert()` to change the time zone of a datetime object.

**Example**:
```python
# Converting UTC to Eastern Time (US)
eastern_datetime = utc_datetime.tz_convert('America/New_York')
print("Eastern Time:", eastern_datetime)
```
**Output**:
```
Eastern Time: 2023-01-01 07:00:00-05:00
```

### 3. Creating Time Series with Time Zones
• **Definition**: You can create a time series with a specific time zone using `pd.date_range()` and the `tz` parameter.
• **Method**:
  - Use `pd.date_range()` with the `tz` parameter to create a time series.

**Example**:
```python
# Creating a time series with a specific time zone
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D', tz='UTC')
print(date_rng)
```
**Output**:
```
DatetimeIndex(['2023-01-01 00:00:00+00:00', '2023-01-02 00:00:00+00:00',
               '2023-01-03 00:00:00+00:00', '2023-01-04 00:00:00+00:00',
               '2023-01-05 00:00:00+00:00', '2023-01-06 00:00:00+00:00',
               '2023-01-07 00:00:00+00:00', '2023-01-08 00:00:00+00:00',
               '2023-01-09 00:00:00+00:00', '2023-01-10 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
```

### 4. Handling Daylight Saving Time (DST)
• **Definition**: Time zones may have daylight saving time adjustments, which can affect time calculations.
• **Method**:
  - Use time zone-aware datetime objects to handle DST automatically.

**Example**:
```python
# Creating a time series with DST
date_rng_dst = pd.date_range(start='2023-03-01', end='2023-03-15', freq='D', tz='America/New_York')
print(date_rng_dst)
```
**Output**:
```
DatetimeIndex(['2023-03-01 00:00:00-05:00', '2023-03-02 00:00:00-05:00',
               '2023-03-03 00:00:00-05:00', '2023-03-04 00:00:00-05:00',
               '2023-03-05 00:00:00-05:00', '2023-03-06 00:00:00-05:00',
               '2023-03-07 00:00:00-05:00', '2023-03-08 00:00:00-05:00',
               '2023-03-09 00:00:00-05:00', '2023-03-10 00:00:00-05:00',
               '2023-03-11 00:00:00-05:00', '2023-03-12 00:00:00-04:00',
               '2023-03-13 00:00:00-04:00', '2023-03-14 00:00:00-04:00',
               '2023-03-15 00:00:00-04:00'],
              dtype='datetime64[ns, America/New_York]', freq='D')
```

### 5. Converting Between Time Zones
• **Definition**: You can convert a time series from one time zone to another while preserving the local time.
• **Method**:
  - Use `tz_convert()` to change the time zone.

**Example**:
```python
# Converting from Eastern Time to Pacific Time
pacific_time = date_rng_dst.tz_convert('America/Los_Angeles')
print(pacific_time)
```
**Output**:
```
DatetimeIndex(['2023-02-28 22:00:00-08:00', '2023-03-01 22:00:00-08:00',
               '2023-03-02 22:00:00-08:00', '2023-03-03 22:00:00-08:00',
               '2023-03-04 22:00:00-08:00', '2023-03-05 22:00:00-08:00',
               '2023-03-06 22:00:00-08:00', '2023-03-07 22:00:00-08:00',
               '2023-03-08 22:00:00-08:00', '2023-03-09 22:00:00-08:00',
               '2023-03-10 22:00:00-08:00', '2023-03-11 22:00:00-07:00',
               '2023-03-12 22:00:00-07:00', '2023-03-13 22:00:00-07:00',
               '2023-03-14 22:00:00-07:00'],
              dtype='datetime64[ns, America/Los_Angeles]', freq='D')
```

### 6. Localizing Naive Datetime Objects
• **Definition**: You can convert naive datetime objects (without time zone information) to time zone-aware datetime objects.
• **Method**:
  - Use `tz_localize()` to set the time zone.

**Example**:
```python
# Creating a naive datetime object
naive_datetime = pd.to_datetime('2023-01-01 12:00:00')

# Localizing to a specific time zone
localized_datetime = naive_datetime.tz_localize('UTC')
print(localized_datetime)
```
**Output**:
```
2023-01-01 12:00:00+00:00
```

### 7. Handling Time Zone-Aware DataFrames
• **Definition**: You can create DataFrames with time zone-aware datetime indices, allowing for easier manipulation of time series data.
• **Method**:
  - Use `pd.date_range()` with the `tz` parameter to create a time zone-aware index.

**Example**:
```python
# Creating a DataFrame with a time zone-aware index
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D', tz='UTC')
df_time_zone = pd.DataFrame({'data': range(len(date_rng))}, index=date_rng)
print(df_time_zone)
```
**Output**:
```
                     data
2023-01-01 00:00:00+00:00     0
2023-01-02 00:00:00+00:00     1
2023-01-03 00:00:00+00:00     2
2023-01-04 00:00:00+00:00     3
2023-01-05 00:00:00+00:00     4
2023-01-06 00:00:00+00:00     5
2023-01-07 00:00:00+00:00     6
2023-01-08 00:00:00+00:00     7
2023-01-09 00:00:00+00:00     8
2023-01-10 00:00:00+00:00     9
```

---
---
---

### **Sparse Data structures**

Sparse data structures in Pandas are designed to efficiently store and manipulate data that contains a significant number of missing or zero values. This is particularly useful in scenarios where memory efficiency is crucial, such as in large datasets with many missing entries. Pandas provides the `SparseDataFrame` and `SparseSeries` classes to handle such data. Below are key topics related to sparse data structures in Pandas, along with definitions, use cases, and examples.

### 1. Introduction to Sparse Data
• **Definition**: Sparse data refers to datasets where most of the elements are zero or missing. Storing such data in a dense format can lead to inefficient memory usage.
• **Use Cases**: Sparse data structures are commonly used in fields like natural language processing (NLP), recommendation systems, and any domain where data is often missing or zero.

### 2. Creating Sparse DataFrames
• **Definition**: You can create a sparse DataFrame using the `pd.SparseDataFrame` class or by specifying the `sparse=True` parameter when creating a DataFrame.
• **Method**:
  - Use `pd.SparseDataFrame()` or `pd.DataFrame()` with `sparse=True`.

**Example**:
```python
import pandas as pd
import numpy as np

# Creating a dense DataFrame
data = {
    'A': [1, 0, 0, 4],
    'B': [0, 0, 3, 0],
    'C': [0, 2, 0, 0]
}
df_dense = pd.DataFrame(data)

# Creating a Sparse DataFrame
df_sparse = pd.DataFrame(data, dtype='Sparse[int]')
print(df_sparse)
```
**Output**:
```
     A     B     C
0  1.0   NaN   NaN
1  NaN   NaN   2.0
2  NaN   3.0   NaN
3  4.0   NaN   NaN
```

### 3. Working with Sparse DataFrames
• **Definition**: Sparse DataFrames allow you to perform standard DataFrame operations while efficiently managing memory.
• **Method**:
  - Use standard DataFrame methods on sparse DataFrames.

**Example**:
```python
# Performing operations on Sparse DataFrame
df_sparse['D'] = df_sparse['A'] + df_sparse['B']
print(df_sparse)
```
**Output**:
```
     A     B     C    D
0  1.0   NaN   NaN  1.0
1  NaN   NaN   2.0  NaN
2  NaN   3.0   NaN  3.0
3  4.0   NaN   NaN  4.0
```

### 4. Converting to Sparse Format
• **Definition**: You can convert an existing dense DataFrame to a sparse format using the `to_sparse()` method.
• **Method**:
  - Use `DataFrame.to_sparse()` to convert to a sparse DataFrame.

**Example**:
```python
# Converting a dense DataFrame to sparse
df_dense_sparse = df_dense.astype('Sparse[int]')
print(df_dense_sparse)
```
**Output**:
```
     A     B     C
0  1.0   NaN   NaN
1  NaN   NaN   NaN
2  NaN   3.0   NaN
3  4.0   NaN   NaN
```

### 5. Sparse Series
• **Definition**: Similar to Sparse DataFrames, you can create Sparse Series to handle one-dimensional sparse data.
• **Method**:
  - Use `pd.Series()` with `dtype='Sparse'`.

**Example**:
```python
# Creating a Sparse Series
sparse_series = pd.Series([1, 0, 0, 4], dtype='Sparse[int]')
print(sparse_series)
```
**Output**:
```
0    1
1    NaN
2    NaN
3    4
dtype: Sparse[int64, 0]
```

### 6. Memory Usage of Sparse DataFrames
• **Definition**: Sparse DataFrames can significantly reduce memory usage compared to dense DataFrames, especially when dealing with large datasets with many missing values.
• **Method**:
  - Use `memory_usage()` to compare memory usage.

**Example**:
```python
# Checking memory usage
print("Dense DataFrame memory usage:")
print(df_dense.memory_usage(deep=True))

print("Sparse DataFrame memory usage:")
print(df_sparse.memory_usage(deep=True))
```
**Output**:
```
Dense DataFrame memory usage:
Index    128
A       32
B       32
C       32
dtype: int64
Sparse DataFrame memory usage:
Index    128
A       32
B       32
C       32
D       32
dtype: int64
```

### 7. Performance Considerations
• **Definition**: While sparse data structures can save memory, they may have performance trade-offs for certain operations compared to dense structures.
• **Method**:
  - Use sparse structures when you have a significant number of missing values, but be aware of potential performance impacts.

---
---
---

### **Advanced Input/Output Functions**

Pandas provides a variety of advanced input/output (I/O) functions that allow you to read from and write to different file formats and data sources. These functions enable you to work with structured data efficiently, whether it's from local files, databases, or web sources. Below are key topics related to advanced I/O functions in Pandas, along with definitions, use cases, and examples.

### 1. Reading and Writing CSV Files

#### 1.1 Reading CSV Files with Options
• **Definition**: You can read CSV files with various options to handle different formats and delimiters.
• **Method**:
  - Use `pd.read_csv()` with parameters like `sep`, `header`, `na_values`, and `dtype`.

**Example**:
```python
import pandas as pd

# Reading a CSV file with custom options
df_csv = pd.read_csv('data.csv', sep=',', header=0, na_values=['NA', 'N/A'], dtype={'column_name': 'float64'})
print(df_csv.head())
```

#### 1.2 Writing CSV Files with Options
• **Definition**: You can write DataFrames to CSV files with options to customize the output format.
• **Method**:
  - Use `DataFrame.to_csv()` with parameters like `index`, `header`, and `na_rep`.

**Example**:
```python
# Writing DataFrame to a CSV file
df_csv.to_csv('output.csv', index=False, header=True, na_rep='Missing')
```

### 2. Reading and Writing Excel Files

#### 2.1 Reading Excel Files
• **Definition**: You can read Excel files with multiple sheets and specify which sheet to read.
• **Method**:
  - Use `pd.read_excel()` with the `sheet_name` parameter.

**Example**:
```python
# Reading an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df_excel.head())
```

#### 2.2 Writing Excel Files
• **Definition**: You can write DataFrames to Excel files, including multiple sheets.
• **Method**:
  - Use `DataFrame.to_excel()` with the `sheet_name` parameter.

**Example**:
```python
# Writing DataFrame to an Excel file
df_excel.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
```

### 3. Reading and Writing JSON Files

#### 3.1 Reading JSON Files
• **Definition**: You can read JSON files into a DataFrame, handling nested structures if necessary.
• **Method**:
  - Use `pd.read_json()` with parameters to specify the format.

**Example**:
```python
# Reading a JSON file
df_json = pd.read_json('data.json')
print(df_json.head())
```

#### 3.2 Writing JSON Files
• **Definition**: You can write DataFrames to JSON format, specifying the orientation.
• **Method**:
  - Use `DataFrame.to_json()` with the `orient` parameter.

**Example**:
```python
# Writing DataFrame to a JSON file
df_json.to_json('output.json', orient='records', lines=True)
```

### 4. Reading and Writing SQL Databases

#### 4.1 Reading from SQL Databases
• **Definition**: You can read data from SQL databases using SQL queries.
• **Method**:
  - Use `pd.read_sql()` to execute a query and return a DataFrame.

**Example**:
```python
import sqlite3

# Connecting to a SQLite database
conn = sqlite3.connect('database.db')

# Reading data from a SQL table
df_sql = pd.read_sql('SELECT * FROM table_name', conn)
print(df_sql.head())

# Closing the connection
conn.close()
```

#### 4.2 Writing to SQL Databases
• **Definition**: You can write DataFrames to SQL tables.
• **Method**:
  - Use `DataFrame.to_sql()` to write data to a specified table.

**Example**:
```python
# Connecting to a SQLite database
conn = sqlite3.connect('database.db')

# Writing DataFrame to a SQL table
df_sql.to_sql('table_name', conn, if_exists='replace', index=False)

# Closing the connection
conn.close()
```

### 5. Reading and Writing HTML Data

#### 5.1 Reading HTML Tables
• **Definition**: You can read tables from HTML pages into DataFrames.
• **Method**:
  - Use `pd.read_html()` to extract tables from HTML.

**Example**:
```python
# Reading HTML tables from a URL
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
dfs = pd.read_html(url)

# Displaying the first DataFrame (the first table on the page)
df_html = dfs[0]
print(df_html.head())
```

#### 5.2 Writing DataFrames to HTML
• **Definition**: You can write DataFrames to HTML format.
• **Method**:
  - Use `DataFrame.to_html()` to export DataFrames to HTML.

**Example**:
```python
# Writing DataFrame to an HTML file
df_html.to_html('output.html', index=False)
```

### 6. Handling Compression
• **Definition**: You can read and write compressed files (e.g., gzip, zip) directly using Pandas.
• **Method**:
  - Use the `compression` parameter in read/write functions.

**Example**:
```python
# Reading a compressed CSV file
df_compressed = pd.read_csv('data.csv.gz', compression='gzip')
print(df_compressed.head())

# Writing a DataFrame to a compressed CSV file
df_compressed.to_csv('output.csv.gz', index=False, compression='gzip')
```

### 7. Handling File Paths
• **Definition**: When importing or exporting files, you can specify relative or absolute file paths.
• **Method**:
  - Use appropriate file paths based on your working directory.

**Example**:
```python
# Importing from a relative path
df_relative = pd.read_csv('./data/data.csv')

# Exporting to an absolute path
df_relative.to_csv('/Users/username/Documents/output.csv', index=False)
```

---
---
---

### **Parallel Processing**

* Parallel processing in Pandas can significantly enhance performance when working with large datasets by utilizing multiple CPU cores to perform computations concurrently. While Pandas itself does not natively support parallel processing, there are several techniques and libraries that can be used to achieve parallelism.

### 1. Using Dask for Parallel Processing
• **Definition**: Dask is a flexible parallel computing library for analytics that integrates seamlessly with Pandas. It allows you to work with larger-than-memory datasets and perform computations in parallel.
• **Method**:
  - Use `dask.dataframe` to create Dask DataFrames that mimic Pandas DataFrames but operate in parallel.

**Example**:
```python
import dask.dataframe as dd

# Creating a Dask DataFrame from a CSV file
ddf = dd.read_csv('large_data.csv')

# Performing operations on Dask DataFrame
result = ddf.groupby('column_name').mean().compute()  # Use compute() to get the result
print(result)
```

### 2. Using Joblib for Parallel Processing
• **Definition**: Joblib is a library that provides tools for lightweight pipelining in Python. It can be used to parallelize operations using the `Parallel` and `delayed` functions.
• **Method**:
  - Use `joblib.Parallel` to execute functions in parallel.

**Example**:
```python
from joblib import Parallel, delayed
import pandas as pd

# Sample function to apply
def process_row(row):
    return row['A'] * 2  # Example operation

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Using Joblib to process rows in parallel
results = Parallel(n_jobs=-1)(delayed(process_row)(row) for index, row in df.iterrows())
df['Processed'] = results
print(df)
```
**Output**:
```
   A   B  Processed
0  1  10          2
1  2  20          4
2  3  30          6
3  4  40          8
4  5  50         10
```

### 3. Using Multiprocessing
• **Definition**: The `multiprocessing` module in Python allows you to create multiple processes, each running in its own Python interpreter. This can be used to parallelize operations on DataFrames.
• **Method**:
  - Use `multiprocessing.Pool` to distribute tasks across multiple processes.

**Example**:
```python
import pandas as pd
from multiprocessing import Pool

# Sample function to apply
def process_row(row):
    return row['A'] * 2  # Example operation

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Using multiprocessing to process rows in parallel
with Pool(processes=4) as pool:
    results = pool.map(process_row, [row for index, row in df.iterrows()])

df['Processed'] = results
print(df)
```
**Output**:
```
   A   B  Processed
0  1  10          2
1  2  20          4
2  3  30          6
3  4  40          8
4  5  50         10
```

### 4. Using Modin for Parallel DataFrame Operations
• **Definition**: Modin is a library that provides a drop-in replacement for Pandas, allowing you to speed up your Pandas operations by using all available CPU cores.
• **Method**:
  - Install Modin and use it as a replacement for Pandas.

**Example**:
```python
# Install Modin using pip
# !pip install modin[ray]  # Uncomment to install

import modin.pandas as mpd

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
}
df_modin = mpd.DataFrame(data)

# Performing operations in parallel
df_modin['Processed'] = df_modin['A'] * 2
print(df_modin)
```

### 5. Performance Considerations
• **Definition**: While parallel processing can significantly speed up computations, it may introduce overhead due to process management and data serialization. It's important to assess whether the performance gain justifies the overhead.
• **Method**:
  - Use profiling tools to measure performance and identify bottlenecks.

---
---
---

In [None]:
# @title
