# Selecting Rows

### Introduction

In this lesson, we'll work selecting data across multiple rows.

### Selecting by rows

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [None]:
movies = [{'index': 1,
  'title': 'Shazam!',
  'release_date': 1553299200,
  'genre': 'Comedy'}, {'index': 2,
  'title': 'Captain Marvel',
  'release_date': 1551830400,
  'genre': 'Action'},  {'index': 3,
  'title': 'Escape Room',
  'release_date': 1546473600,
  'genre': 'Horror'}, {'index': 4,
  'title': 'How to Train A Dragon',
  'release_date': 1546473600,
  'genre': 'Animation'}]

In [None]:
movies_df = spark.createDataFrame(movies)

In [None]:
movies_df.show()

+---------+-----+------------+--------------------+
|    genre|index|release_date|               title|
+---------+-----+------------+--------------------+
|   Comedy|    1|  1553299200|             Shazam!|
|   Action|    2|  1551830400|      Captain Marvel|
|   Horror|    3|  1546473600|         Escape Room|
|Animation|    4|  1546473600|How to Train A Dr...|
+---------+-----+------------+--------------------+



The only way to do so is to `filter` rows for those that have an index of 1.  Here's how: 

In [None]:
movies_df[movies_df['index'] == 1].show()

+------+-----+------------+-------+
| genre|index|release_date|  title|
+------+-----+------------+-------+
|Comedy|    1|  1553299200|Shazam!|
+------+-----+------------+-------+



### Breaking it down in pandas

Now understanding why the above works in PySpark is a little complicated to see.  It's actually easier if we briefly switch over to Pandas.

> Pandas is a Python library that also allows us to easily create dataframes.  But unlike Pyspark, it does not work with distributed datasets.

The operation in to select by a row, is quite similar.  We first import the pandas library and create the dataframe.

In [None]:
import pandas as pd

df = pd.DataFrame(movies)
df

Unnamed: 0,index,title,release_date,genre
0,1,Shazam!,1553299200,Comedy
1,2,Captain Marvel,1551830400,Action
2,3,Escape Room,1546473600,Horror
3,4,How to Train A Dragon,1546473600,Animation


And then we select the rows whose index is 1.

In [None]:
df[df['index'] == 1]

Unnamed: 0,index,title,release_date,genre
0,1,Shazam!,1553299200,Comedy


So we can see that this is essentially the same way that we select by row in Pyspark.

```python
movies_df[movies_df['index'] == 1].show()
```

Ok, now let's explain why the pandas statement `df[df['index'] == 1]`.  The key is understanding the part inside of the square brackets.

In [None]:
df['index'] == 1

0     True
1    False
2    False
3    False
Name: index, dtype: bool

Notice that this returns a column of `True` or `False` values.  And these values are based on whether that row's index equals 1.  So this is why only the first record returns True.

Then we pass this column of True or False values to the dataframe, and for each row where column's entry is True, we display that row.

In [None]:
df[df['index'] == 1]

Unnamed: 0,index,title,release_date,genre
0,1,Shazam!,1553299200,Comedy


So we can imagine the above as passing `True, False, False, False` to the dataframe, to only display the first row above.

In [None]:
bool_values = [True, False, False, False]

df[bool_values]

Unnamed: 0,index,title,release_date,genre
0,1,Shazam!,1553299200,Comedy


### Moving to Pyspark

So selecting rows in Pyspark essentially works the same way.  We start with our dataframe.

In [None]:
movies_df.show()

+---------+-----+------------+--------------------+
|    genre|index|release_date|               title|
+---------+-----+------------+--------------------+
|   Comedy|    1|  1553299200|             Shazam!|
|   Action|    2|  1551830400|      Captain Marvel|
|   Horror|    3|  1546473600|         Escape Room|
|Animation|    4|  1546473600|How to Train A Dr...|
+---------+-----+------------+--------------------+



And then we can filter for rows where the index is 1 with the following.

In [None]:
movies_df[movies_df['index'] == 1].show()

+------+-----+------------+-------+
| genre|index|release_date|  title|
+------+-----+------------+-------+
|Comedy|    1|  1553299200|Shazam!|
+------+-----+------------+-------+



Where we can imagine the middle brackets as returning True or False values for each row of the dataframe.

In [None]:
movies_df['index'] == 1

Column<'(index = 1)'>

### Summary

In this lesson, we learned how to select certain rows in a Pyspark dataframe.  As we saw, we do so with something like the following: 

In [None]:
movies_df[movies_df['index'] == 1].show()

+------+-----+------------+-------+
| genre|index|release_date|  title|
+------+-----+------------+-------+
|Comedy|    1|  1553299200|Shazam!|
+------+-----+------------+-------+



And the key part is understanding the component in between the brackets.

In [None]:
movies_df['index'] == 1

Column<'(index = 1)'>

We can imagine the code above returning a True or False value based on whether the value in the index column equals 1.  

And then we pass this statement into our dataframe to return the records where the above statement returns True.

In [None]:
movies_df[movies_df['index'] == 1].show()

+------+-----+------------+-------+
| genre|index|release_date|  title|
+------+-----+------------+-------+
|Comedy|    1|  1553299200|Shazam!|
+------+-----+------------+-------+



### Resources

[Gitbook Understanding Spark](https://mallikarjuna_g.gitbooks.io/spark/content/spark-overview.html)