# 4.2 *Row* Operations: Selecting and Subsetting/Filtering
- [Selecting Rows: Using loc](#Selecting-Rows:-Using-loc)   
  - [If the Dataframe Has No Index](#Selecting-Rows-with-loc:-If-the-Dataframe-Has-No-Index)  
  - [Using the Dataframe Index](#Selecting-Rows-with-loc:-Using-the-Dataframe-Index) 


- [Selecting Rows: Using *iloc*](#Selecting-Rows-with-iloc)  


- [Selecting Rows with Boolean Filters](#Selecting-Rows-with-Boolean-Filters) (**Probably our most used!**)  
  - Using the AND Operator  
  - Using the OR Operator


- [Selecting Rows By Field Content](#Selecting-Rows-By-Field-Content)
  
 
- [Removing Rows](#Removing-Rows)   

- [Making the First Row the Column Names](#Making-the-First-Row-the-Column-Names)  


- [Sorting Rows](#Sorting-Rows)

In [1]:
import pandas as pd 

In [2]:
#Read the csv file into a pandas dataframe
df = pd.read_csv('Data/Data_Olympics.csv')

#Display the first five records/rows in the dataframe
df.head(3)

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70


In [23]:
df.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86], dtype=int64)

# Selecting Rows: Using *loc* 
- The two main approaches to selecting rows in pandas are **loc** and **iloc**:    
  - **loc** uses the **index** for a row (or the default row number if there is no index)  
  - **iloc** use the row number **position** (starting with 0!)  

## Selecting Rows with *loc*: If the Dataframe Has No Index

In [9]:
df.head(6)

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70
3,4,Russia (RUS),19,17,19,55
4,5,Germany (GER),17,10,15,42
5,6,Japan (JPN),12,8,21,41


In [24]:
# Selecting a single row with loc: no index, default row number
df.loc[0] # The first row of the dataframe 

Rank                         1
Country    United States (USA)
Gold                        46
Silver                      37
Bronze                      38
Total                      121
Name: 0, dtype: object

In [25]:
df.loc[1] # The second row of the dataframe 

Rank                         2
Country    Great Britain (GBR)
Gold                        27
Silver                      23
Bronze                      17
Total                       67
Name: 1, dtype: object

In [31]:
df.loc[1: ].head() # All rows but the first 

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70
3,4,Russia (RUS),19,17,19,55
4,5,Germany (GER),17,10,15,42
5,6,Japan (JPN),12,8,21,41


In [32]:
df.loc[0:4] # The first row to the fifth row

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70
3,4,Russia (RUS),19,17,19,55
4,5,Germany (GER),17,10,15,42


## Selecting Rows with *loc*: Using the Dataframe Index  
In order to use the Index of a Datframe with loc, the dataframe must have an Index!

In [33]:
# No Index for this Dataframe
df.head(3)

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70


In [37]:
# Create an Index:  Set it to be the Country Column
df_new = df.set_index('Country')
df_new.head(10)

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41
France (FRA),7,10,18,14,42
South Korea (KOR),8,9,3,9,21
Italy (ITA),9,8,12,8,28
Australia (AUS),10,8,11,10,29


In [38]:
# Display all of the values for the new Dataframe Index
df_new.index.values

array(['United States (USA)', 'Great Britain (GBR)', 'China (CHN)',
       'Russia (RUS)', 'Germany (GER)', 'Japan (JPN)', 'France (FRA)',
       'South Korea (KOR)', 'Italy (ITA)', 'Australia (AUS)',
       'Netherlands (NED)', 'Hungary (HUN)', 'Brazil (BRA)*',
       'Spain (ESP)', 'Kenya (KEN)', 'Jamaica (JAM)', 'Croatia (CRO)',
       'Cuba (CUB)', 'New Zealand (NZL)', 'Canada (CAN)',
       'Uzbekistan (UZB)', 'Kazakhstan (KAZ)', 'Colombia (COL)',
       'Switzerland (SUI)', 'Iran (IRI)', 'Greece (GRE)',
       'Argentina (ARG)', 'Denmark (DEN)', 'Sweden (SWE)',
       'South Africa (RSA)', 'Ukraine (UKR)', 'Serbia (SRB)',
       'Poland (POL)', 'North Korea (PRK)', 'Belgium (BEL)',
       'Thailand (THA)', 'Slovakia (SVK)', 'Georgia (GEO)',
       'Azerbaijan (AZE)', 'Belarus (BLR)', 'Turkey (TUR)',
       'Armenia (ARM)', 'Czech Republic (CZE)', 'Ethiopia (ETH)',
       'Slovenia (SLO)', 'Indonesia (INA)', 'Romania (ROU)',
       'Bahrain (BRN)', 'Vietnam (VIE)', 'Chinese Taipei

In [40]:
# Use loc with the Country index
df_new.loc['United States (USA)']

Rank        1
Gold       46
Silver     37
Bronze     38
Total     121
Name: United States (USA), dtype: int64

In [41]:
df_new.head(6)

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41


In [44]:
# China and everthing after
df_new.loc['China (CHN)':].head()

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41
France (FRA),7,10,18,14,42


In [None]:
df_new.loc[]

In [43]:
df_new.head(10)

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41
France (FRA),7,10,18,14,42
South Korea (KOR),8,9,3,9,21
Italy (ITA),9,8,12,8,28
Australia (AUS),10,8,11,10,29


In [45]:
# Slicing from the beginning through France
df_new.loc[:'France (FRA)']

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41
France (FRA),7,10,18,14,42


In [46]:
# Slicing from France to the end
df_new.loc['France (FRA)': ].head()

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
France (FRA),7,10,18,14,42
South Korea (KOR),8,9,3,9,21
Italy (ITA),9,8,12,8,28
Australia (AUS),10,8,11,10,29
Netherlands (NED),11,8,7,4,19


## Selecting Rows with *iloc*  
*iloc* use the row number position (starting with 0!)

In [47]:
df_new.head(3)

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70


In [48]:
# Selecting the first row of the dataframe
df_new.iloc[0] 

Rank        1
Gold       46
Silver     37
Bronze     38
Total     121
Name: United States (USA), dtype: int64

In [49]:
# Select the second row of the dataframe 
df_new.iloc[1] 

Rank       2
Gold      27
Silver    23
Bronze    17
Total     67
Name: Great Britain (GBR), dtype: int64

In [51]:
# Select All rows but the first row
df.iloc[1:].head() 

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70
3,4,Russia (RUS),19,17,19,55
4,5,Germany (GER),17,10,15,42
5,6,Japan (JPN),12,8,21,41


# Selecting Rows with Boolean Filters

In [52]:
df_new.head()

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42


In [56]:
# Create a row filter:  Top five for Rank
row_filter = df_new['Rank'] <= 5

# Use the row filter to subset the dataframe
df_top_five = df_new[ row_filter ]

# Display the results
df_top_five.head()

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70
Russia (RUS),4,19,17,19,55
Germany (GER),5,17,10,15,42


In [58]:
# Create a row filter:  Only countries with more than 20 Gold Medals
row_filter = df_new['Gold'] >= 20

df_new[ row_filter ].head()

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States (USA),1,46,37,38,121
Great Britain (GBR),2,27,23,17,67
China (CHN),3,26,18,26,70


In [64]:
# Using the AND operator: Ranked between 5 and 10
row_filter1 = df_new['Rank'] >= 5
row_filter2 = df_new['Rank'] <= 10

df_2nd_five = df_new[row_filter1 & row_filter2]
df_2nd_five

Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41
France (FRA),7,10,18,14,42
South Korea (KOR),8,9,3,9,21
Italy (ITA),9,8,12,8,28
Australia (AUS),10,8,11,10,29


In [66]:
# Using the OR operator: Either ranked 5 or 6
row_filter1 = df_new['Rank'] == 5
row_filter2 = df_new['Rank'] == 6

df_five_or_six = df_new[row_filter1 | row_filter2]
df_five_or_six.head()


Unnamed: 0_level_0,Rank,Gold,Silver,Bronze,Total
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Germany (GER),5,17,10,15,42
Japan (JPN),6,12,8,21,41


# Selecting Rows By Field Content

In [None]:
df.head()

In [67]:
df_ia_index = df['Country'].str.contains('ia')
df_ia = df[df_ia_index]
df_ia.head()

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
3,4,Russia (RUS),19,17,19,55
9,10,Australia (AUS),8,11,10,29
16,17,Croatia (CRO),5,3,2,10
22,23,Colombia (COL),3,2,3,8
31,32,Serbia (SRB),2,4,2,8


# Removing Rows

In [68]:
#Read the CSV file
df_raw = pd.read_csv('Data/Data_RowOperations.csv')
df_raw.head()

Unnamed: 0,Store Data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,0,,,,,,,
1,1,OrderID,Order Date,Product,Size (US),Colour,Price ($),Store
2,2,1,42006,Boots,7,Gold,25,California
3,3,2,42006,Boots,4,Black,25,New York
4,4,3,42006,Boots,6,Red,25,Georgia


In [71]:
# Delete the first row by keeping everything but the first row!
df_cleaned = df_raw.iloc[1:]
df_cleaned.head()

Unnamed: 0,Store Data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
1,1,OrderID,Order Date,Product,Size (US),Colour,Price ($),Store
2,2,1,42006,Boots,7,Gold,25,California
3,3,2,42006,Boots,4,Black,25,New York
4,4,3,42006,Boots,6,Red,25,Georgia


# Making the First Row the Column Names

In [72]:
df_cleaned.columns 

Index(['Store Data', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4',
       'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7'],
      dtype='object')

In [77]:
# 2.Set the column names to the first row
df_cleaned.columns = df_cleaned.iloc[0]
df_cleaned.head()

1,1.1,OrderID,Order Date,Product,Size (US),Colour,Price ($),Store
1,1,OrderID,Order Date,Product,Size (US),Colour,Price ($),Store
2,2,1,42006,Boots,7,Gold,25,California
3,3,2,42006,Boots,4,Black,25,New York
4,4,3,42006,Boots,6,Red,25,Georgia


In [78]:
# Now Delete the first row
df_final = df_cleaned.iloc[1:]
df_final.head()

1,1.1,OrderID,Order Date,Product,Size (US),Colour,Price ($),Store
2,2,1,42006,Boots,7,Gold,25,California
3,3,2,42006,Boots,4,Black,25,New York
4,4,3,42006,Boots,6,Red,25,Georgia


# Sorting Columns

In [79]:
# Reread csv file to start with a fresh dataframe
# Read the csv file into a pandas dataframe
df = pd.read_csv('Data\Data_Olympics.csv')
df.head(10)

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70
3,4,Russia (RUS),19,17,19,55
4,5,Germany (GER),17,10,15,42
5,6,Japan (JPN),12,8,21,41
6,7,France (FRA),10,18,14,42
7,8,South Korea (KOR),9,3,9,21
8,9,Italy (ITA),8,12,8,28
9,10,Australia (AUS),8,11,10,29


### Sorting by One Column 

In [80]:
# Sort (existing dataframe) by Number of Gold medals, Descending
# Note:  inplace
df.sort_values('Gold', inplace=True, ascending=False)
df.head()

Unnamed: 0,Rank,Country,Gold,Silver,Bronze,Total
0,1,United States (USA),46,37,38,121
1,2,Great Britain (GBR),27,23,17,67
2,3,China (CHN),26,18,26,70
3,4,Russia (RUS),19,17,19,55
4,5,Germany (GER),17,10,15,42


### Sorting by Multiple Columns  

In [81]:
#Read the CSV file
df_countries = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv')
df_countries.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [82]:
# Sort df_Countries by Continent and then Country
df_countries.sort_values(['continent', 'country'], inplace=True, ascending=[True, False])

df_countries.head(5)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
192,Zimbabwe,64,18,4,4.7,Africa
191,Zambia,32,19,4,2.5,Africa
179,Uganda,45,9,0,8.3,Africa
175,Tunisia,51,3,20,1.3,Africa
172,Togo,36,2,19,1.3,Africa


In [83]:
# Create a new dataframe (df_continents) based on a sorted copy of df_countries
# Sort order: Continent and Country, both Ascending
df_continents = df_countries.sort_values(['continent', 'country'], inplace=False, ascending=[False, False])
df_continents.head(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
188,Venezuela,333,100,3,7.7,South America
185,Uruguay,115,35,220,6.6,South America
163,Suriname,128,178,7,5.6,South America
133,Peru,163,160,21,6.1,South America
132,Paraguay,213,117,74,7.3,South America
72,Guyana,93,302,1,7.1,South America
52,Ecuador,162,74,3,4.2,South America
37,Colombia,159,76,3,4.2,South America
35,Chile,130,124,172,7.6,South America
23,Brazil,245,145,16,7.2,South America
