# Selecting Subsets of Data from a Series

Selecting subsets of data from a Series is accomplished similarly to how it's done with DataFrames. 

## Series indexer rules

The same three indexers, `[]`, `loc`, and `iloc`, are available for the Series. Because there are no columns in a Series, the rules for each indexer are slightly different than they are for a DataFrame. Let's begin by reading in the movie dataset and setting the index to the title.

In [1]:
import pandas as pd
movie = pd.read_csv('input/movie.csv', index_col='title')
movie.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'input/movie.csv'

Let's select a single column of data so that we can have access to a Series. Here, we select the `imdb_score` column.

In [None]:
imdb = movie['imdb_score']
imdb.head(3)

### Series subset selection with just the brackets

For DataFrames, we learned that *just the brackets* accepted either a single label or a list of labels and used this input to select one or more DataFrame columns. For a Series, *just the brackets* has different rules that you must follow to use it correctly. It allows selection by index label. For instance, we can select the `imdb_score` for the movie Avatar like this:

In [None]:
imdb['Avatar']

Interestingly enough, it's possible to use integer location as well with *just the brackets*. The movie Avatar is at integer location 0 and we can duplicate our previous result by using it.

In [None]:
imdb[0]

## Use `loc` and `iloc` instead of just the brackets

For a Series, *just the brackets* is flexible and can take either a label or integer location. This might make it seem like `loc` and `iloc` would be unnecessary, but the opposite is actually the case. Using *just the brackets* for a Series is ambiguous and not explicit. It's not clear whether the label or integer location are being used.

I suggest only using `loc` and `iloc` for clarity. Whenever the `loc` indexer is used, we are certain it selects by label. Likewise, whenever the `iloc` indexer is used, we are certain it selects by integer location.

## Series subset selection with `loc`

The `loc` indexer selects by **label** just as it does with DataFrames. Since there are no columns, it only accepts a single selection object which can be any of the following:

* A single label
* A list of labels
* A slice with labels
* A boolean Series (covered in a later chapter)

### Select a single value with `loc`

Select a single value by providing the `loc` indexer the name of the index. Here, we select the `imdb_score` of the movie Forrest Gump. When selecting a single value, just that value is returned and not a Series.

In [None]:
imdb.loc['Forrest Gump']

### Select multiple values using a list with `loc`

Provide the `loc` indexer a list of index labels to select multiple values. This will return a Series.

In [None]:
names = ['Good Will Hunting', 'Home Alone', 'Meet the Parents']
imdb.loc[names]

### Select multiple values using slice notation with `loc`

Provide the `loc` indexer index labels for the start and stop components of slice notation to select all of the values between those two labels. The results are **inclusive** of the stop label.

In [None]:
imdb.loc['Home Alone':'Top Gun']

As with any slice notation, all components are optional. Here, we select every `imdb_score` from the movie Twins to the end.

In [None]:
imdb.loc['Twins':].head()

In this example, we select every 300th `imdb_score` beginning at the movie Twins to the end.

In [None]:
imdb.loc['Twins'::300]

## Series subset selection with `iloc`

The Series `iloc` indexer is analogous to `loc` except that it only makes selection via integer location. Here are the valid kinds of selections.

* A single integer location
* A list of integer locations
* A slice with integer locations

### Select a single value with `iloc`

Let's select the `imdb_score` for the movie with integer location 499.

In [None]:
imdb.iloc[499]

Selecting with a single integer always returns the value by itself and not within a Series. If we want to return a one-item Series, so that we can see the index, we can use a one-item list as our selection.

In [None]:
imdb.iloc[[499]]

### Select multiple values using a list with `iloc`

Provide `iloc` a list of integer locations to select multiple values.

In [None]:
ints = [499, 599, 699]
imdb.iloc[ints]

### Select multiple values using slice notation with `iloc`

Provide `iloc` with slice notation using integers as the stop and start components to select all the values between those two locations. The results are **exclusive** of the last integer. Here, we select integer locations 145 through, but not including 148.

In [None]:
imdb.iloc[145:148]

Let's select the last three values using slice notation.

In [None]:
imdb.iloc[-3:]

Let's select every 200th value from integer location 1,000 to 2,000

In [None]:
imdb.iloc[1000:2000:200]

## Summary of Series subset selection

The three indexers, `[]`, `loc`, and `iloc` are available to make subset selections on a Series. They work similarly as they do on DataFrames

* The `loc` indexer makes selections by label using a:
    * single label
    * list of labels
    * slice of labels
    * boolean Series
* The `iloc` indexer makes selections by integer location using a:
    * single integer location
    * list of integer locations
    * slice of integer locations
* Use `loc` and `iloc` instead of *just the brackets* to be explicit
* There are no columns in a Series, so selection is only based on the index

## Exercises

Execute the cell below to select the `duration` column (length of movie in minutes) as a Series and use it for the first few exercises.

In [None]:
duration = movie['duration']
duration.head()

### Read in bikes dataset

Read in the bikes dataset and select the `wind_speed` column by executing the cell below and use it for the rest of the exercises. Notice that the index labels are integers, meaning that when you use `loc` you will be using integers.

In [None]:
bikes = pd.read_csv('input/bikes.csv')
wind = bikes['wind_speed']
wind.head()

# Setting a Meaningful Index

The index of a DataFrame provides a label for each of the rows. If not explicitly provided, pandas uses a sequence of consecutive integers beginning at 0 as the index. In this chapter, we learn how to set one of the columns of the DataFrame as the new index so that it provides a more meaningful label for each row.

## Setting an index of a DataFrame

Instead of using the default index for your pandas DataFrame, you can call the `set_index` method to use one of the columns as the index. Let's read in a small dataset to show how this is done. Note the current index is just consecutive integers beginning from 0.

In [None]:
import pandas as pd
df = pd.read_csv('input/sample_data.csv')
df

### The `set_index` method

Pass the `set_index` method the name of the column to use it as the index. This column will no longer be part of the data of the returned DataFrame and the original index no longer be there.

In [None]:
df.set_index('name')

### A new DataFrame copy is returned

The `set_index` method returns an entire new DataFrame copy by default, and does not modify the original calling DataFrame. Let's verify this by outputting the DataFrame referenced by `df`. It has not changed.

In [None]:
df

### Assigning the result of `set_index` to a variable name

We must assign the result of the `set_index` method to a variable name if we are to use this new DataFrame with its new index.

In [None]:
df2 = df.set_index('name')
df2

### Number of columns decreased

The new DataFrame, `df2`, has one less column than the original as the `name` column was set as the index. Let's verify this by accessing the `shape` attribute of the original and new DataFrames.

In [None]:
df.shape

In [None]:
df2.shape

## Accessing the index, columns, and data

The index, columns, and data are each separate objects that can be accessed from the DataFrame as attributes and NOT methods. Let's assign each of them to their own variable name beginning with the index and output it to the screen.

In [None]:
index = df2.index
index

In [None]:
columns = df2.columns
columns

In [None]:
data = df2.values
data

### Find the type of these objects

The output of these objects looks correct, but we don't know the exact type of each one. Let's find out the types of each object.

In [None]:
type(index)

In [None]:
type(columns)

In [None]:
type(data)

### Accessing the components does not change the DataFrame

Accessing these components does nothing to our DataFrame. It merely gives us a variable to reference each of these components. Let's verify that the DataFrame remains unchanged.

In [None]:
df2

### pandas `Index` type

Both the index and columns are a special type of object named `Index`. This `Index` object is somewhat similar to a Python list. It is a sequence of labels for either the rows or the columns. You will not deal with this object much directly, so we will not go into further details about it here.

### Two-dimensional numpy array

The values are returned as a single two-dimensional numpy array.

### Operating with the DataFrame and not its components

You rarely need to operate with these components directly and instead will be working with the entire DataFrame. But, it is important to understand that they are separate components and you can access them directly if needed.

## Accessing the components of a Series

Similarly, we can access the two Series components - the index and the data. Let's first select a single column from our DataFrame so that we have a Series. When we select a column from the DataFrame as a Series, the index remains the same.

In [None]:
color = df2['color']
color

Let's access the index and the data from the `color` Series without assigning them to separate variables.

In [None]:
color.index

In a Series, the values are stored as a 1-dimensional numpy array.

In [None]:
color.values

### The default index

If you don't specify an index when first reading in a DataFrame, then pandas creates one for you as the sequence of integers integers beginning at 0. Let's read in the movie dataset and keep the default index.

In [None]:
movie = pd.read_csv('input/movie.csv')
movie.head(3)

### Integers in the index

The integers you see above in the index are the labels for each of the rows. Let's examine the underlying index object.

In [None]:
idx = movie.index
idx

We can also verify its type.

In [None]:
type(idx)

### The RangeIndex

pandas has various types of index objects. A `RangeIndex` is the simplest index and represents the sequence of consecutive integers beginning at 0. It is similar to a Python `range` object in that the values are not actually stored in memory.

### A numpy array underlies the index

The index has a `values` attribute just like the DataFrame. Use it to retrieve the underlying index values as a numpy array.

In [None]:
idx.values

It's not necessary to assign the index to a variable name to access its attributes and methods. You can access it beginning from the DataFrame.

In [None]:
movie.index.values

## Setting an index on read

The `read_csv` function provides dozens of parameters that allow us to read in a wide variety of text files. The `index_col` parameter may be used to select a particular column as the index. We can either use the column name or its integer location.

### Reread the movie dataset with the movie title as the index

There's a column in the movie dataset named `title`. Let's reread the data using it as the index.

In [None]:
movie = pd.read_csv('input/movie.csv', index_col='title')
movie.head(3)

Notice that now the titles of each movie serve as the label for each row. Also notice that the word **title** appears directly above the index. This is a bit confusing. The word **title** is NOT a column name. Technically, it is the **name** of the index, but this isn't important at the moment.

### Access the new index and output its type

Let's access this new index, output its values, and verify that its type is now `Index` instead of `RangeIndex`.

In [None]:
idx2 = movie.index
idx2

In [None]:
type(idx2)

### Select a value from the index

The index is a complex object on its own and has many attributes and methods. The minimum we should know about an index is how to select values from it. We can select single values from an index just like we do with a Python list, by placing the integer location of the item we want within the square brackets. Here, we select the 4th item (integer location 3) from the index.

In [None]:
idx2[3]

We can select this same index label without actually assigning the index to a variable first.

In [None]:
movie.index[3]

### Selection with slice notation

As with Python lists, you can select a range of values using slice notation. Provide the start, stop, and step components of slice notation separated by a colon within the brackets.

In [None]:
idx2[100:120:4]

### Selection with a list of integers

You can select multiple individual values with a list of integers. This type of selection does not exist for Python lists.

In [None]:
nums = [1000, 453, 713, 2999]
idx2[nums]

## Choosing a good index

Before even considering using one of the columns as an index, know that it's not a necessity. You can complete all of your analysis tasks with just the default `RangeIndex`. The reason the index is mentioned in this book, is that there are some tasks that become easier with a custom index. Also, many other pandas users do analysis with the index, so it's important to understand how it works.

If you do choose to set an index for your DataFrame, I suggest using columns that are both **unique** and **descriptive**. Pandas does not enforce uniqueness for its index allowing the same value to repeat multiple times. That said, a good index will have unique values to identify each row.

### Verifying uniqueness in the index

The `set_index` method has the ability to verify that all values used for the index are unique by setting the `verify_integrity` parameter to `True`.

In [None]:
movie2 = pd.read_csv('input/movie.csv')
movie2.set_index('title', verify_integrity=True).head(3)

Attempting to set the index to a column with duplicate values with raise an error. The `color` column has only a few unique values and fails to be set as the index when `verify_integrity` is set to `True`.