# Slicing and Selecting in DataFrames in Pandas"

When discussing "Slicing and Selecting in DataFrames," we typically refer to accessing specific rows and columns of a DataFrame in pandas. Let's demonstrate this using the gapminder dataset, which contains demographic data over time across different countries.

First, let's assume you have loaded the gapminder dataset into a pandas DataFrame.

In [4]:
import pandas as pd

# URL to the raw CSV file on GitHub
url = 'https://raw.githubusercontent.com/kirenz/datasets/master/gapminder.csv'

# Read the CSV file into a DataFrame
gapminder = pd.read_csv(url)

## Using `.loc[]` for Label-Based Indexing

The `.loc[]` indexer allows precise selection of rows and columns in a DataFrame using labels (index values). It provides a robust method to access specific subsets of data based on label-based indexing.


```{admonition} **Syntax:**

subset = df.loc[row_labels, column_labels]
```

```{admonition} **Description:**
:class: hint
- `row_labels`: Labels of the rows to select. Can be a single label, a list/array of labels, or a slice object.
- `column_labels` (optional): Labels of the columns to select. Can be a single label, a list/array of labels, or a slice object.

```

#### Example 1: Selecting a Single Row with All Columns

Select data for the first row in the DataFrame:

In [6]:
subset = gapminder.loc[0]
print(subset)

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object


####  Example 2: Selecting Specific Rows and Columns

Select data for specific rows (indices 10, 20, 30) and columns ('country', 'year', 'pop'):

In [7]:
subset = gapminder.loc[[10, 20, 30], ['country', 'year', 'pop']]
print(subset)

        country  year       pop
10  Afghanistan  2002  25268405
20      Albania  1992   3326498
30      Algeria  1982  20033753


This will print out the specified columns for the rows with indices 10, 20, and 30 in the gapminder DataFrame.

####  Example 3: Selecting All Rows for Specific Columns

Select all rows for columns 'country', 'continent', and 'lifeExp':

In [8]:
subset = gapminder.loc[:, ['country', 'continent', 'lifeExp']]
print(subset.head())


       country continent  lifeExp
0  Afghanistan      Asia   28.801
1  Afghanistan      Asia   30.332
2  Afghanistan      Asia   31.997
3  Afghanistan      Asia   34.020
4  Afghanistan      Asia   36.088


This will display the first few rows of the gapminder DataFrame containing only the specified columns.

#### Example 4: Selecting Specific Rows and Columns

Select rows for specific countries ('United States', 'China') and columns ('country', 'year', 'pop'):

In [9]:
subset = gapminder.loc[gapminder['country'].isin(['United States', 'China']), ['country', 'year', 'pop']]
print(subset)

            country  year         pop
288           China  1952   556263527
289           China  1957   637408000
290           China  1962   665770000
291           China  1967   754550000
292           China  1972   862030000
293           China  1977   943455000
294           China  1982  1000281000
295           China  1987  1084035000
296           China  1992  1164970000
297           China  1997  1230075000
298           China  2002  1280400000
299           China  2007  1318683096
1608  United States  1952   157553000
1609  United States  1957   171984000
1610  United States  1962   186538000
1611  United States  1967   198712000
1612  United States  1972   209896000
1613  United States  1977   220239000
1614  United States  1982   232187835
1615  United States  1987   242803533
1616  United States  1992   256894189
1617  United States  1997   272911760
1618  United States  2002   287675526
1619  United States  2007   301139947


#### Example 5: Selecting Specific Rows with All Columns

Select data for the country 'India':

In [10]:
india_data = gapminder.loc[gapminder['country'] == 'India']
print(india_data.head())

    country continent  year  lifeExp        pop   gdpPercap
696   India      Asia  1952   37.373  372000000  546.565749
697   India      Asia  1957   40.249  409000000  590.061996
698   India      Asia  1962   43.605  454000000  658.347151
699   India      Asia  1967   47.193  506000000  700.770611
700   India      Asia  1972   50.651  567000000  724.032527


```{Notes}

- `.loc[]` allows precise selection of rows and columns using labels.
- It's inclusive of both endpoints when using slices (`:`).
- Ensure labels used are present in the DataFrame index to avoid `KeyError` exceptions.

```

<hr>

## Using `.iloc[]` for Integer-Based Indexing

The `.iloc[]` indexer allows you to select rows and columns from a DataFrame using integer-based indexing. It provides a straightforward way to access specific subsets of data based on numerical positions.

```{admonition} **Syntax:**

subset = df.iloc[row_indices, column_indices]
```

```{admonition} **Description:**
:class: hint
- `row_indices`: Integer indices of the rows to select. This can be a single integer, a list/array of integers, or a slice object.
- `column_indices` (optional): Integer indices of the columns to select. This can also be a single integer, a list/array of integers, or a slice object.
```

#### Examples 6: Selecting specific rows and columns:

Select data for the first three rows and columns 'country', 'year', and 'pop':

In [11]:
subset = gapminder.iloc[:3, [0, 1, 4]]
print(subset)

       country continent       pop
0  Afghanistan      Asia   8425333
1  Afghanistan      Asia   9240934
2  Afghanistan      Asia  10267083


####  Example 7: Selecting All Rows for Specific Columns

Select all rows for columns at positions 0, 3, and 5:

In [12]:
subset = gapminder.iloc[:, [0, 3, 5]]
print(subset.head())

       country  lifeExp   gdpPercap
0  Afghanistan   28.801  779.445314
1  Afghanistan   30.332  820.853030
2  Afghanistan   31.997  853.100710
3  Afghanistan   34.020  836.197138
4  Afghanistan   36.088  739.981106


This will display the first few rows of the gapminder DataFrame containing only the columns at positions 0, 3, and 5.

#### Example 8: Selecting a Single Row with All Columns

Select data for the row at position 100:

In [15]:
row_data = gapminder.iloc[100]
print(row_data)

country      Bangladesh
continent          Asia
year               1972
lifeExp          45.252
pop            70759295
gdpPercap    630.233627
Name: 100, dtype: object


## Using Boolean Indexing

Boolean indexing in pandas allows you to filter rows of a DataFrame based on a condition expressed as a boolean array. This method is versatile and powerful for selecting subsets of data that meet specific criteria.

```{admonition} **Syntax:**

To perform boolean indexing, you typically use a boolean condition that evaluates to `True` or `False` for each row in the DataFrame. The condition can involve comparisons, logical operations (`&`, `|`, `~`), or boolean masks.

```

#### Example 9: Selecting data for the year 2007 using boolean indexing


In [16]:
subset_bool = gapminder[gapminder['year'] == 2007]
print(subset_bool.head())

        country continent  year  lifeExp       pop     gdpPercap
11  Afghanistan      Asia  2007   43.828  31889923    974.580338
23      Albania    Europe  2007   76.423   3600523   5937.029526
35      Algeria    Africa  2007   72.301  33333216   6223.367465
47       Angola    Africa  2007   42.731  12420476   4797.231267
59    Argentina  Americas  2007   75.320  40301927  12779.379640


#### Example 10: Selecting Data with Two Conditions

Selecting data for countries 'United States' and 'China' in the year 2007

In [17]:
subset_bool = gapminder[(gapminder['country'].isin(['United States', 'China'])) & (gapminder['year'] == 2007)]
print(subset_bool.head())

            country continent  year  lifeExp         pop     gdpPercap
299           China      Asia  2007   72.961  1318683096   4959.114854
1619  United States  Americas  2007   78.242   301139947  42951.653090


* `(gapminder['country'].isin(['United States', 'China']))` creates a boolean mask selecting rows where the country is either 'United States' or 'China'.
* `(gapminder['year'] == 2007)` creates another boolean mask selecting rows where the year is 2007.
* The `&` operator combines both boolean masks, ensuring both conditions are `True` for a row to be included in `subset_bool`.

#### Example 11:  Selecting Data with Three Conditions

Selecting data for countries with population greater than 1 billion, in Asia, and before the year 2000


In [18]:
subset_bool = gapminder[(gapminder['pop'] > 1e9) & (gapminder['continent'] == 'Asia') & (gapminder['year'] < 2000)]
print(subset_bool.head())

    country continent  year  lifeExp         pop    gdpPercap
294   China      Asia  1982   65.525  1000281000   962.421381
295   China      Asia  1987   67.274  1084035000  1378.904018
296   China      Asia  1992   68.690  1164970000  1655.784158
297   China      Asia  1997   70.426  1230075000  2289.234136


* `(gapminder['pop'] > 1e9)` selects rows where the population (pop) is greater than 1 billion.
* `(gapminder['continent'] == 'Asia')` selects rows where the continent is 'Asia'.
* `(gapminder['year'] < 2000)` selects rows where the year is before 2000.
* The `&` operator combines all three conditions, ensuring all conditions must be `True` for a row to be included in `subset_bool`.

```{Notes}

* Use parentheses `()` to group conditions properly, especially when combining multiple conditions.
* Boolean indexing (`[]`) allows flexible and powerful filtering based on complex criteria in pandas DataFrames.
* Ensure conditions are logically structured to accurately capture the subset of data needed for analysis.

```

## Using `.query()` for Conditional Selection

`.query()` is a method in pandas that allows you to filter rows of a DataFrame using a boolean expression. It provides a convenient way to write complex queries without explicitly referencing the DataFrame each time.


```{admonition} **Syntax:**

To use `.query()`, you specify a string containing the boolean expression as the argument. This expression can reference column names directly.
```

#### Example 12: Basic Usage
Selecting data for the year 2007 using .query()

In [19]:
subset_query = gapminder.query("year == 2007")
print(subset_query.head())

        country continent  year  lifeExp       pop     gdpPercap
11  Afghanistan      Asia  2007   43.828  31889923    974.580338
23      Albania    Europe  2007   76.423   3600523   5937.029526
35      Algeria    Africa  2007   72.301  33333216   6223.367465
47       Angola    Africa  2007   42.731  12420476   4797.231267
59    Argentina  Americas  2007   75.320  40301927  12779.379640


In this example:

* `year == 2007` is the boolean expression passed to `.query()`, selecting rows where the year column equals 2007.

#### Example 13:  Using Logical Operators
Selecting data for countries with life expectancy greater than 80 and GDP per capita greater than 35000 in 2007

In [32]:
subset_query = gapminder.query("lifeExp > 80 & gdpPercap > 35000 & year == 2007")
print(subset_query.head())

               country continent  year  lifeExp       pop    gdpPercap
251             Canada  Americas  2007   80.653  33390141  36319.23501
671   Hong Kong, China      Asia  2007   82.208   6980412  39724.97867
695            Iceland    Europe  2007   81.757    301931  36180.78919
1151            Norway    Europe  2007   80.196   4627926  49357.19017
1487       Switzerland    Europe  2007   81.701   7554661  37506.41907


In this example:

* `lifeExp > 80 & gdpPercap > 35000 & year == 2007` combines multiple conditions using logical operators (&), selecting rows where life expectancy (lifeExp) is greater than 80, GDP per capita (gdpPercap) is greater than 35000, and year is 2007.

#### Example 14:  Using String Methods

Selecting data for countries starting with 'A' in the year 2007

In [33]:
subset_query = gapminder.query("country.str.startswith('A') & year == 2007")
print(subset_query.head())

        country continent  year  lifeExp       pop     gdpPercap
11  Afghanistan      Asia  2007   43.828  31889923    974.580338
23      Albania    Europe  2007   76.423   3600523   5937.029526
35      Algeria    Africa  2007   72.301  33333216   6223.367465
47       Angola    Africa  2007   42.731  12420476   4797.231267
59    Argentina  Americas  2007   75.320  40301927  12779.379640


In this example:

* `country.str.startswith('A') & year == 2007` uses string methods (`str.startswith()`) to filter rows where the country name starts with 'A' and the year is 2007.

```{Notes}

*`.query()` is useful for writing concise and readable code for conditional selection.
* Ensure the boolean expression is correctly formatted within the query string.
* Column names used in the query can be referenced directly without using the DataFrame name.

```