# **Data Selection & Indexing**

In [2]:
import pandas as pd

## 5. **Boolean Indexing & Filtering Techniques**

Boolean indexing allows you to **filter rows** in a DataFrame or Series using **conditions**. It's one of the most used techniques in **real-world data analysis and data science workflows** — especially for **cleaning, filtering, and subsetting data**.

### 🔹 Topics We’ll Cover

1. Basic Boolean Indexing
2. Combining Multiple Conditions
3. Filtering with `.isin()`
4. Filtering with `.between()`
5. Filtering with string methods (`.str.contains()`, etc.)
6. The `.query()` method
7. Real-time scenarios


## 🔸 1. Basic Boolean Indexing

### ✅ **Syntax**

```python
df[condition]
```

In [3]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['NY', 'London', 'Paris', 'Berlin']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,London
2,Charlie,35,Paris
3,David,40,Berlin


In [4]:
df['Age'] > 30

0    False
1    False
2     True
3     True
Name: Age, dtype: bool

In [5]:
df[df['Age'] > 30]

Unnamed: 0,Name,Age,City
2,Charlie,35,Paris
3,David,40,Berlin


## 🔸 2. Combining Multiple Conditions

Use:

* `&` for AND
* `|` for OR
* `~` for NOT

**Wrap each condition in parentheses**!

In [6]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,London
2,Charlie,35,Paris
3,David,40,Berlin


In [7]:
# Age should be more than 30 and City should be Paris
df[(df['Age'] > 30) & (df['City'] == 'Paris')]

Unnamed: 0,Name,Age,City
2,Charlie,35,Paris


In [8]:
# City either London or NY
df[(df['City'] == 'London') | (df['City'] == 'NY')]

Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,London


In [9]:
# All data except City not paris
df[~(df['City'] == 'Paris')]

Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,London
3,David,40,Berlin


## 🔸 3. Filtering with `.isin()`

Used for filtering rows that match **multiple values** in a column.

In [10]:
# Data with cities London and Paris
df[df['City'].isin(['London', 'Paris'])]

Unnamed: 0,Name,Age,City
1,Bob,30,London
2,Charlie,35,Paris


In [11]:
# Data except city London
df[~df['City'].isin(['London'])]

Unnamed: 0,Name,Age,City
0,Alice,25,NY
2,Charlie,35,Paris
3,David,40,Berlin


## 🔸 4. Filtering with `.between()`
Check if values lie within a **range** (inclusive).

In [12]:
df[df['Age'].between(30, 40)]

Unnamed: 0,Name,Age,City
1,Bob,30,London
2,Charlie,35,Paris
3,David,40,Berlin


## 🔸 5. Filtering with String Methods

These are **vectorized operations** used with string columns.

| Method                          | Purpose            |
| ------------------------------- | ------------------ |
| `.str.contains()`               | Contains substring |
| `.str.startswith()`             | Starts with string |
| `.str.endswith()`               | Ends with string   |
| `.str.lower()` / `.str.upper()` | Case normalization |

In [13]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,NY
1,Bob,30,London
2,Charlie,35,Paris
3,David,40,Berlin


In [15]:
df['City'].str.contains('on')

0    False
1     True
2    False
3    False
Name: City, dtype: bool

In [16]:
df[df['City'].str.contains('on')]

Unnamed: 0,Name,Age,City
1,Bob,30,London


In [17]:
df[df['Name'].str.startswith('A')]

Unnamed: 0,Name,Age,City
0,Alice,25,NY


In [18]:
df[df['City'].str.lower() == 'paris']

Unnamed: 0,Name,Age,City
2,Charlie,35,Paris


### ⚠️ Important:

* Use `na=False` to avoid errors with NaNs in string columns.


In [19]:
df['City'].str.contains('on', na=False)

0    False
1     True
2    False
3    False
Name: City, dtype: bool

## 🔸 6. `.query()` Method

An alternative and more readable way for filtering.

### ✅ Syntax:

```python
df.query('Age > 30 and City == "Paris"')

In [20]:
df.query('City in ["London", "Berlin"]')

Unnamed: 0,Name,Age,City
1,Bob,30,London
3,David,40,Berlin


#### ✅ When to use:

* Complex filters
* External string construction (e.g., user-driven queries)


## 🔸 7. Real-Time Use Cases

| Task                                     | Code                                           |
| ---------------------------------------- | ---------------------------------------------- |
| ✅ Customers older than 60                | `df[df['Age'] > 60]`                           |
| ✅ Orders between \$500 and \$1000        | `df[df['Amount'].between(500, 1000)]`          |
| ✅ Users from certain cities              | `df[df['City'].isin(['Bangalore', 'Mumbai'])]` |
| ✅ Products containing "Pro"              | `df[df['ProductName'].str.contains("Pro")]`    |
| ✅ Customers whose email ends with ".org" | `df[df['Email'].str.endswith('.org')]`         |
| ✅ Use `.query()` to filter large dataset | `df.query('Age < 30 and City == "NY"')`        |


## 🧠 Performance Tip

✅ **Boolean indexing is fast and optimized** in pandas.
However, for **large datasets**, chaining `.query()` with `.loc[]` can sometimes be **faster** and more **readable**.

---

## 🧪 Practice Scenarios (Optional)

* Get all students with scores above average
* Select transactions in last 7 days
* Filter users who joined in 2024 and are from "India"
* Extract product names with "Plus", "Pro", or "Ultra" in them



---

## ✅ Summary Table

| Method            | Use Case         | Syntax Example                        |                     |
| ----------------- | ---------------- | ------------------------------------- | ------------------- |
| Boolean condition | Simple filtering | `df[df['Age'] > 30]`                  |                     |
| `&`, \`           | `, `\~\`         | Compound conditions                   | `(cond1) & (cond2)` |
| `.isin()`         | Value list match | `df[df['City'].isin(['A', 'B'])]`     |                     |
| `.between()`      | Range check      | `df[df['Score'].between(80, 90)]`     |                     |
| `.str.contains()` | Substring search | `df[df['Name'].str.contains("John")]` |                     |
| `.query()`        | Readable queries | `df.query('Age > 40')`                |                     |


<center><b>Thanks</b></center>