# Slicing and Indexing Dataframes
## Explicit Indexes
### Setting and removing indexes
- pandas lets you set columns as an index in a DataFrame.
- Setting an index enables:
- Cleaner code when subsetting data.
- More efficient lookups in some cases.
- You can also remove an index and reset the DataFrame back to its default integer indexing.

In [1]:
import pandas as pd

# Load the books dataset
books = pd.read_csv(r'C:\Users\Dell\OneDrive\Desktop\KaranCodes\Datacampcourses\Associate-Data-Scientist-Python-Track\resources\DatamanipulationwithPandas-datasets\books.csv')

# Look at books
print(books)

# Set the index of books to title
books_ind = books.set_index("title")

# Look at books_ind
print(books_ind)

# Reset the books_ind index, keeping its contents
print(books_ind.reset_index())

# Reset the books_ind index, dropping its contents
print(books_ind.reset_index(drop=True))

    book_id                                 title                  author  \
0         1                         The Alchemist            Paulo Coelho   
1         2                         Atomic Habits             James Clear   
2         3                               Sapiens       Yuval Noah Harari   
3         4                             Deep Work             Cal Newport   
4         5               Harry Potter and the PS            J.K. Rowling   
5         6                        The Subtle Art             Mark Manson   
6         7                   Think and Grow Rich           Napoleon Hill   
7         8                              Educated           Tara Westover   
8         9                              Becoming          Michelle Obama   
9        10                    The Power of Habit          Charles Duhigg   
10       11                            The Hobbit          J.R.R. Tolkien   
11       12                                  1984           George Orwell   

### Subsetting with .loc[]
- The killer feature for indexes is the .loc[] method where .loc[] allows subsetting based on index values.
- When passed a single argument, .loc[] selects a subset of rows.
- Advantages of using .loc[]:
  - Easier to read than standard square bracket [] subsetting.
  - Simplifies code maintenance by making your code clearer and cleaner.

In [2]:
# Set the index to "title"
books_ind = books.set_index("title")

# Make a list of titles to subset on
selected_titles = ["The Alchemist", "Atomic Habits", "Sapiens"]

# Subset books using square brackets
print(books[books["title"].isin(selected_titles)])

# Subset books_ind using .loc[]
print(books_ind.loc[selected_titles])

   book_id          title             author    genre  pages  published_year  \
0        1  The Alchemist       Paulo Coelho  Fantasy    263            2007   
1        2  Atomic Habits        James Clear  Romance    336            2014   
2        3        Sapiens  Yuval Noah Harari  History    337            1989   

   rating  
0     3.6  
1     4.6  
2     4.0  
               book_id             author    genre  pages  published_year  \
title                                                                       
The Alchemist        1       Paulo Coelho  Fantasy    263            2007   
Atomic Habits        2        James Clear  Romance    336            2014   
Sapiens              3  Yuval Noah Harari  History    337            1989   

               rating  
title                  
The Alchemist     3.6  
Atomic Habits     4.6  
Sapiens           4.0  


### Setting multi-level indexes
- Indexes can be created from multiple columns, forming a multi-level index (also called a hierarchical index).
- Benefits of multi-level indexes:
- Make it easier to reason about nested categorical variables.
- Examples:
  - In clinical trials: test subjects nested inside treatment groups.
  - In temperature datasets: cities nested inside countries.
- Downside:
  - Manipulating indexes requires different syntax compared to manipulating regular columns.
  - You must learn and manage two syntaxes and stay aware of how your data is structured.

In [3]:
# Set the index to ["genre", "author"]
books_ind = books.set_index(["genre", "author"])

# List of tuples: ("Fantasy", "Paulo Coelho") and ("History", "Yuval Noah Harari")
rows_to_keep = [("Fantasy", "Paulo Coelho"), ("History", "Yuval Noah Harari")]

# Subset for rows to keep
print(books_ind.loc[rows_to_keep])

                           book_id          title  pages  published_year  \
genre   author                                                             
Fantasy Paulo Coelho             1  The Alchemist    263            2007   
History Yuval Noah Harari        3        Sapiens    337            1989   

                           rating  
genre   author                     
Fantasy Paulo Coelho          3.6  
History Yuval Noah Harari     4.0  


### Sorting by index values
- You can change the row order in a DataFrame using .sort_values().
- To sort by elements in the index, you should use .sort_index().
- Sorting by the index can make your data easier to browse and more organized, especially when using multi-level indexes.

In [4]:
# Set the index to ["genre", "author"]
books_ind = books.set_index(["genre", "author"])

# Sort books_ind by index values
print(books_ind.sort_index())

# Sort books_ind by index values at the author level
print(books_ind.sort_index(level=["author"]))

# Sort books_ind by genre ascending, then author descending
print(books_ind.sort_index(level=["genre", "author"], ascending=[True, False]))

                                     book_id  \
genre        author                            
Fantasy      Ali Hazelwood                43   
             Colleen Hoover               33   
             Colleen Hoover               34   
             Erin Morgenstern             29   
             Gabrielle Zevin              42   
             Gail Honeyman                28   
             George Orwell                12   
             Glennon Doyle                32   
             J.R.R. Tolkien               11   
             Michelle Obama                9   
             Paulo Coelho                  1   
             Tara Westover                31   
Fiction      Aldous Huxley                21   
             Charles Duhigg               10   
             Colleen Hoover               46   
             Delia Owens                  20   
             F. Scott Fitzgerald          14   
             Janet Skeslien Charles       39   
             Markus Zusak               

## Slicing and subsetting with .loc and .iloc
### Slicing index values
- Slicing allows you to select consecutive elements using the first:last syntax.
- In DataFrames, you can slice by index values (not just row/column numbers).
- Slicing by index is done inside .loc[].
- Key points to remember:
  - You can only slice an index if it is sorted using .sort_index().
  - To slice at the outer level of a multi-level index, first and last can be strings.
  - To slice at inner levels, first and last must be tuples.
  - Passing a single slice to .loc[] slices the rows.

In [5]:
# Set the index to ["genre", "author"]
books_ind = books.set_index(["genre", "author"])

# Sort the index of books_ind
books_srt = books_ind.sort_index()

# Subset rows from Fantasy to Romance (by genre)
print(books_srt.loc["Fantasy":"Romance"])

# Try to subset rows from Paulo Coelho to Yuval Noah Harari (by author only, will cause error unless both levels match)
# print(books_srt.loc["Paulo Coelho":"Yuval Noah Harari"])  # ❌ This will cause an error

# Subset rows from ("Fantasy", "Paulo Coelho") to ("History", "Yuval Noah Harari")
print(books_srt.loc[("Fantasy", "Paulo Coelho"):("History", "Yuval Noah Harari")])

                                     book_id  \
genre        author                            
Fantasy      Ali Hazelwood                43   
             Colleen Hoover               33   
             Colleen Hoover               34   
             Erin Morgenstern             29   
             Gabrielle Zevin              42   
             Gail Honeyman                28   
             George Orwell                12   
             Glennon Doyle                32   
             J.R.R. Tolkien               11   
             Michelle Obama                9   
             Paulo Coelho                  1   
             Tara Westover                31   
Fiction      Aldous Huxley                21   
             Charles Duhigg               10   
             Colleen Hoover               46   
             Delia Owens                  20   
             F. Scott Fitzgerald          14   
             Janet Skeslien Charles       39   
             Markus Zusak               

### Slicing in both directions
- DataFrames are two-dimensional, so you can slice both rows and columns simultaneously.
- This is done by passing two arguments to .loc[]:
  - The first argument slices the rows.
  - The second argument slices the columns.
- Slicing both directions at once makes subsetting more powerful and flexible.

In [6]:
# Set the index to ["genre", "author"]
books_ind = books.set_index(["genre", "author"])

# Sort the index
books_srt = books_ind.sort_index()

# Subset rows from Fantasy, J.K. Rowling to History, Yuval Noah Harari
print(books_srt.loc[("Fantasy", "J.K. Rowling"):("History", "Yuval Noah Harari")])

# Subset columns from title to rating
print(books_srt.loc[:, "title":"rating"])

# Subset in both directions at once
print(books_srt.loc[("Fantasy", "J.K. Rowling"):("History", "Yuval Noah Harari"), "title":"rating"])

                                book_id                    title  pages  \
genre   author                                                            
Fantasy J.R.R. Tolkien               11               The Hobbit    182   
        Michelle Obama                9                 Becoming    490   
        Paulo Coelho                  1            The Alchemist    263   
        Tara Westover                31                 Educated    470   
Fiction Aldous Huxley                21          Brave New World    287   
        Charles Duhigg               10       The Power of Habit    404   
        Colleen Hoover               46                Ugly Love    208   
        Delia Owens                  20  Where the Crawdads Sing    346   
        F. Scott Fitzgerald          14         The Great Gatsby    318   
        Janet Skeslien Charles       39        The Paris Library    331   
        Markus Zusak                 18           The Book Thief    284   
        S.E. Hinton      

### Slicing time series
- Slicing is especially useful for time series data to filter by date ranges.
- To slice time series:
  - Set the date column as the index.
  - Use .loc[] to perform subsetting based on dates.
- Important: Dates should be in ISO 8601 format:
  - "yyyy-mm-dd" for year-month-day.
  - "yyyy-mm" for year-month.
  - "yyyy" for year.

In [5]:
# # Import pandas
# import pandas as pd

# # Load the books dataset
# books = pd.read_csv(r"C:\Users\Dell\OneDrive\Desktop\KaranCodes\Datacampcourses\Associate-Data-Scientist-Python-Track\resources\DatamanipulationwithPandas-datasets\books.csv")

# # Ensure 'published_date' is parsed as datetime
# books["published_date"] = pd.to_datetime(books["published_date"])

# # Use Boolean conditions to subset books published in 2015 and 2016
# books_15_16 = books[
#     (books["published_date"] >= "2015-01-01") & (books["published_date"] <= "2016-12-31")
# ]
# print(books_15_16)

# # Set published_date as the index and sort it
# books_ind = books.set_index("published_date").sort_index()

# # Use .loc[] to subset books published in 2015 and 2016
# print(books_ind.loc["2015":"2016"])

# # Use .loc[] to subset books from August 2015 to February 2016
# print(books_ind.loc["2015-08":"2016-02"])

### Subsetting by row/column number
- While Boolean conditions and index labels are the most common ways to subset rows, you can also subset by row/column numbers.
- This is done using .iloc[].
- Like .loc[], .iloc[] can take two arguments:
  - The first for row numbers.
  - The second for column numbers.

In [6]:
print(books.iloc[22, 1])

# Use slicing to get the first 5 rows
print(books.iloc[:5])

# Use slicing to get columns 3 to 4 (e.g. "genre" and "author", if present)
print(books.iloc[:, 2:4])

# Use slicing in both directions at once
print(books.iloc[:5, 2:4])

The Midnight Library
   book_id                    title             author    genre  pages  \
0        1            The Alchemist       Paulo Coelho  Fantasy    263   
1        2            Atomic Habits        James Clear  Romance    336   
2        3                  Sapiens  Yuval Noah Harari  History    337   
3        4                Deep Work        Cal Newport  History    323   
4        5  Harry Potter and the PS       J.K. Rowling  Romance    361   

   published_year  rating  
0            2007     3.6  
1            2014     4.6  
2            1989     4.0  
3            2009     3.8  
4            1957     3.9  
                    author         genre
0             Paulo Coelho       Fantasy
1              James Clear       Romance
2        Yuval Noah Harari       History
3              Cal Newport       History
4             J.K. Rowling       Romance
5              Mark Manson     Self-Help
6            Napoleon Hill       Romance
7            Tara Westover       Ficti