# Boolean Selection Single Conditions

## Boolean Selection
Boolean Selection, also referred to as **Boolean Indexing**, is the process of selecting subsets of rows from DataFrames (or Series) based on the actual data values and NOT by their labels or integer locations. All of the previous subset selections where done using either labels or integer location and nothing to do with the actual values.

### Examples of Boolean Selection

Let's see some examples of actual questions (in plain English) that boolean selection can help us answer from the bikes dataset. The term **query** is used to refer to these sorts of questions.

* Find all rides by males
* Find all rides with duration longer than 2 hours
* Find all rides that took place between March and June of 2015.
* Find all the rides with a duration longer than 2 hours by females with temperature higher than 90 degrees

### All queries have a logical condition
Each of the above queries have a strict logical condition that must be checked one row at a time.

### Keep or discard an entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row, as a whole, meets the condition. If so, then it is kept in the result, otherwise it is discarded.

### Each row will have a True or False value associated with it
When you perform boolean selection, each row of the DataFrame (or value of a Series) will have a `True` or `False` value associated with it that corresponds to the outcome of the logical condition.

### Begin with a small DataFrame
We will perform our first boolean selection on our sample DataFrame. Let's read it in now.

In [1]:
import pandas as pd
df = pd.read_csv('../veriler/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


## Manual filtering the data
Let's find all the people who are younger than 30 years old. We will do this manually by inspecting the data.

### Create a list of booleans
By inspecting the data, we see that `Niko`, `Aaron`, and `Penelope` are all under 30 years old. To signify which people are under 30, we create a list of 7 boolean values corresponding to the 7 rows in the DataFrame. The values in the list that correspond with the positions of `Niko`, `Aaron`, and `Penelope` are `True`. All other values are `False`. `Niko`, `Aaron`, and `Penelope` are the 2nd, 3rd, and 4th rows, so these are the locations in the list that are `True`.

In [None]:
filt = [False, True, True, True, False, False, False]

### Variable name `filt`
The variable name `filt`, which is just an abbreviation for filter, will be used throughout the book to refer to the sequence of booleans. You are free to use any variable name you like for the sequence of booleans, but being consistent makes your code easier to understand.

### Pass this list into just the brackets
The above list has `True` in the 2nd, 3rd, and 4th positions. These will be the rows that are kept in the result during boolean selection. Place the list inside the brackets to complete the selection.

In [None]:
df[filt]

### Wait a second… Isn’t `[ ]` just for column selection?

The primary purpose of *just the brackets* for a DataFrame is to select one or more columns by using either a string or a list of strings. All of a sudden, this example shows entire rows being selected with boolean values. This is what makes pandas, unfortunately, a confusing library to learn and use.

## Operator Overloading
*Just the brackets* is **overloaded**. Depending on the inputs, pandas will do something completely different. Here are the rules for the different objects passed to *just the brackets*.

* **string** — return a column as a Series
* **list of strings** — return all those columns as a DataFrame
* **sequence of booleans** — select all rows where `True`
* **slice** — select rows (can do both label and integer location — confusing!) I never do this as it is ambiguous. This will be covered in the **Miscellaneous Subset Selection** chapter.

In summary, *just the brackets* primarily selects columns, but if you pass it a sequence of booleans it will select all rows that are `True`.

### Using booleans in a Series and not a list
Instead of a list, we can use a Series of booleans, which will make the same selection. Below, we use the Series constructor to create a Series object. The Series must have the same index as the DataFrame it is selecting from in order to work properly, so we create it with the same index as the original DataFrame. This automatic alignment of the index is important and covered in the **Joining Data** part.

In [None]:
filt = pd.Series([False, True, True, True, False, False, False], 
                 index=df.index)
filt

### Use the boolean Series to do the boolean selection
Placing the Series directly in the brackets will again select only the rows that correspond with the `True` values in the Series.

In [None]:
df[filt]

## Practical Boolean Selection
We will almost never create boolean lists/Series manually like we did above but instead use the actual data to create them.

### Creating boolean Series from column data
By far the most common way to create a boolean Series will be from the values of one particular column. We will test a condition using one of the six comparison operators:

* `<`
* `<=`
* `>`
* `>=`
* `==`
* `!=`


Let's begin by reading in the bikes dataset.

In [2]:
bikes = pd.read_csv('../veriler/bikes.csv')
bikes.head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


### Create a boolean Series
Let's create a boolean Series by determining which rows have a trip duration greater than 1000 seconds. To make the comparison, we select the `tripduration` column as a Series and compare it against the integer 1000.

In [3]:
filt = bikes['tripduration'] > 1000
filt.head(3)

0    False
1    False
2     True
Name: tripduration, dtype: bool

When we write `bikes['tripduration'] > 1000`, pandas compares each value in the `tripduration` column against 1000. It returns a new Series the same length as `tripduartion` with boolean values corresponding to the outcome of the comparison. Let's verify that the `filt` Series is the same length as the DataFrame.

In [4]:
len(filt)

50089

In [5]:
len(bikes)

50089

### Manually verify correctness
Take a look at the `tripduration` column to manually verify that only the third row satisfied the condition.

### Complete the boolean selection
We can now pass the `filt` boolean Series into *just the brackets* to filter the entire DataFrame to return all the rows in the bikes dataset that have a trip duration greater than 1000. Verify that all `tripduration` values are greater than 1000.

In [6]:
bikes[filt].head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy


### How many rows have a trip duration greater than 1000?
To answer this question, let's assign the result of the boolean selection to a variable and then compare the number of rows between it and the original DataFrame.

In [7]:
bikes_duration_1000 = bikes[filt]

Let's find the number of rows in each DataFrame.

In [8]:
len(bikes)

50089

In [9]:
len(bikes_duration_1000)

10178

We compute that 20% of the rides are longer than 1,000 seconds.

In [10]:
len(bikes_duration_1000) / len(bikes)

0.20319830701351593

## Boolean selection in one line
Often, you will see boolean selection completed in a single line of code instead of the two lines we used above. The expression for the filter is placed directly inside the brackets. Although this method will save a line of code, I recommend assigning the filter as a separate variable to help with readability.

In [11]:
bikes[bikes['tripduration'] > 1000].head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy


## Single condition expression
Our first example tested a single condition (whether the trip duration was 1,000 or more). Let's test a different single condition and find all the rides that happened when the weather was cloudy. We use the `==` operator to test for equality and again pass this variable to the brackets which completes our selection.

In [12]:
filt = bikes['events'] == 'cloudy'
bikes[filt].head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
6,18880,Subscriber,Male,2013-07-02 17:47:00,2013-07-02 17:56:00,565,Clark St & Randolph St,41.884576,-87.63189,31.0,Ravenswood Ave & Irving Park Rd,41.95469,-87.67393,19.0,66.0,10.0,15.0,-9999.0,cloudy
7,19689,Subscriber,Male,2013-07-03 09:07:00,2013-07-03 09:16:00,505,State St & Van Buren St,41.877181,-87.627844,27.0,Franklin St & Jackson Blvd,41.877708,-87.635321,27.0,64.0,7.0,5.8,-9999.0,cloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy


## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as `actor1`. How many of these movies has he starred in?</span>

In [15]:
movie = pd.read_csv('../veriler/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


In [16]:
filt=movie['actor1']=='Tom Hanks'
movie[filt]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story 3,2010.0,Color,G,103.0,Lee Unkrich,125.0,Tom Hanks,15000.0,John Ratzenberger,1000.0,...,721.0,414984497.0,Adventure|Animation|Comedy|Family|Fantasy,453.0,544884,college|day care|escape|teddy bear|toy,English,USA,200000000.0,8.3
The Polar Express,2004.0,Color,G,100.0,Robert Zemeckis,0.0,Tom Hanks,15000.0,Eddie Deezen,726.0,...,267.0,665426.0,Adventure|Animation|Family|Fantasy,188.0,120798,boy|christmas|christmas eve|north pole|train,English,USA,165000000.0,6.6
Angels & Demons,2009.0,Color,PG-13,146.0,Ron Howard,2000.0,Tom Hanks,15000.0,Ayelet Zurer,745.0,...,294.0,133375846.0,Mystery|Thriller,298.0,207839,conclave|illuminati|murder|reference to bernin...,English,USA,150000000.0,6.7
The Da Vinci Code,2006.0,Color,PG-13,174.0,Ron Howard,2000.0,Tom Hanks,15000.0,Seth Gabel,574.0,...,362.0,217536138.0,Mystery|Thriller,294.0,314253,based on supposedly true story|holy grail|mary...,English,USA,125000000.0,6.6
Cloud Atlas,2012.0,Color,R,172.0,Tom Tykwer,670.0,Tom Hanks,15000.0,Jim Sturgess,5000.0,...,1000.0,27098580.0,Drama|Sci-Fi,511.0,284825,composer|future|letter|nonlinear timeline|nurs...,English,Germany,102000000.0,7.5
Toy Story 2,1999.0,Color,G,82.0,John Lasseter,487.0,Tom Hanks,15000.0,John Ratzenberger,1000.0,...,967.0,245823397.0,Adventure|Animation|Comedy|Family|Fantasy,191.0,385871,collector|dog|friend|rescue|toy,English,USA,90000000.0,7.9
Cast Away,2000.0,Color,PG-13,143.0,Robert Zemeckis,0.0,Tom Hanks,15000.0,Paul Sanchez,410.0,...,272.0,233630478.0,Adventure|Drama|Romance,221.0,394317,christmas|island|love|survival|talking to inan...,English,USA,90000000.0,7.7
Road to Perdition,2002.0,Color,R,117.0,Sam Mendes,0.0,Tom Hanks,15000.0,Jennifer Jason Leigh,1000.0,...,818.0,104054514.0,Crime|Drama|Thriller,226.0,200359,1930s|blood|gun|on the run|revenge,English,USA,80000000.0,7.7
The Terminal,2004.0,Color,PG-13,128.0,Steven Spielberg,14000.0,Tom Hanks,15000.0,Chi McBride,466.0,...,232.0,77032279.0,Comedy|Drama,151.0,303864,airport|construction site|fish out of water|fl...,English,USA,60000000.0,7.3
Saving Private Ryan,1998.0,Color,R,169.0,Steven Spielberg,14000.0,Tom Hanks,15000.0,Vin Diesel,14000.0,...,13000.0,216119491.0,Action|Drama|War,219.0,881236,army|invasion|killed in action|normandy|soldier,English,USA,70000000.0,8.6


### Exercise 2
<span  style="color:green; font-size:16px">Select movies with an IMDB score greater than 9.</span>

In [18]:
#movie[movie['imdb_score']>9] #boyle yaptım
filt=movie['imdb_score']>9
movie[filt]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Shawshank Redemption,1994.0,Color,R,142.0,Frank Darabont,0.0,Morgan Freeman,11000.0,Jeffrey DeMunn,745.0,...,461.0,28341469.0,Crime|Drama,199.0,1689764,escape from prison|first person narration|pris...,English,USA,25000000.0,9.3
Towering Inferno,,Color,,65.0,John Blanchard,0.0,Martin Short,770.0,Andrea Martin,179.0,...,176.0,,Comedy,,10,,English,Canada,,9.5
Dekalog,,Color,TV-MA,55.0,,,Krystyna Janda,20.0,Olaf Lubaszenko,3.0,...,2.0,447093.0,Drama,53.0,12590,meaning of life|moral challenge|morality|searc...,Polish,Poland,,9.1
The Godfather,1972.0,Color,R,175.0,Francis Ford Coppola,0.0,Al Pacino,14000.0,Marlon Brando,10000.0,...,3000.0,134821952.0,Crime|Drama,208.0,1155770,crime family|mafia|organized crime|patriarch|r...,English,USA,6000000.0,9.2
Kickboxer: Vengeance,2016.0,,,90.0,John Stockwell,134.0,Matthew Ziff,260000.0,T.J. Storm,454.0,...,354.0,,Action,2.0,246,,,USA,17000000.0,9.1


### Exercise 3
<span  style="color:green; font-size:16px">Write a function that accepts a single parameter to find the number of movies for a given content rating. Use the function to find the number of movies for ratings 'R', 'PG-13', and 'PG'.</span>