# Day 5 Session A - Live Coding

[session link](https://eds-217-essential-python.github.io/course-materials/live-coding/5a_selecting_and_filtering.html)

## Basic Pandas Selection and Filtering
Filtering is on rows and selection is on columns, at the highest level

A more general decsription: Filtering you're doing based on criteria that relate to the values in the df. Selection is not based on values.

## 1. Setup


In [1]:
import pandas as pd

# data
url = 'https://bit.ly/eds217-studentdata'

df = pd.read_csv(url)

In [2]:
df.head()

Unnamed: 0,student_id,age,gpa,major
0,1000,24,2.18,Mathematics
1,1001,21,2.39,Physics
2,1002,22,2.09,Physics
3,1003,24,2.65,Computer Science
4,1004,20,2.78,Chemistry


## 2. Basic Selection


In [8]:
# Selecting a single column of data and assign to a series

majors = df['major'] # if you wanna make this a df, use [['major']]
type(majors)

# Selecting multiple columnds from a df and assign it to a new df
# provide a list of columns into selector/filter brackets
id_majors = df[['student_id', 'major']]
type(id_majors)

pandas.core.frame.DataFrame

## 3. Filtering Based on Column Values

### 3a. Single Condition Filtering


In [10]:
# filtering on the value of a single condition
# filter gpa > 3.7
high_achievers = df[df['gpa'] > 3.7]
type(high_achievers)

pandas.core.frame.DataFrame

In [12]:
valid = df['gpa'] > 3.7
print(valid) # Creates a list of T/F values
# Can then pass into the selection

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98     True
99    False
Name: gpa, Length: 100, dtype: bool


### 3b. Multiple Conditions with Logical Operators


In [13]:
# filtering on values of multiple columns
# usually multiple column values but not always

young_math = df[(df['age'] < 20) & (df['major'] == 'Mathematics')]

# Find students who are either 22 years old or have a GPA of 3.5

specific_students = df[(df['age'] == 22) | (df['gpa'] == 3.5)]

### 3c. Using the filter command

use filter command to match specific columns or rows based on labels (column names, index labels)

By default, filter filters columns

Use `like` argument to filter/select substrings (especially useful for large df with many columns)

In [19]:
# filter all the columns that contain the substring "id"
id_columns = df.filter(like='id')

# filter all the rows where the index contains a '5'
rows_with_5 = df.filter(like='5', axis=0) #first axis is the 0 axis which is rows, second axis is 1 which is columns
print(rows_with_5)

    student_id  age   gpa             major
5         1005   22  2.54       Engineering
15        1015   20  2.40         Chemistry
25        1025   18  3.25           Physics
35        1035   24  3.43  Computer Science
45        1045   21  3.27  Computer Science
50        1050   22  2.82       Engineering
51        1051   20  3.51       Engineering
52        1052   24  2.46           Physics
53        1053   22  2.15  Computer Science
54        1054   18  2.58  Computer Science
55        1055   24  2.32           Physics
56        1056   19  3.86           Physics
57        1057   21  3.62           Physics
58        1058   18  3.27       Mathematics
59        1059   21  3.74         Chemistry
65        1065   22  3.79         Chemistry
75        1075   20  2.44         Chemistry
85        1085   19  2.50  Computer Science
95        1095   23  2.48         Chemistry


In [21]:
# filter column names using a regex instead of like
# Find all the columns that end in the letter 'e'
e_ending_columns = df.filter(regex='e$')
print(e_ending_columns)

    age
0    24
1    21
2    22
3    24
4    20
..  ...
95   23
96   21
97   23
98   24
99   24

[100 rows x 1 columns]


RegexLearn:
https://regexlearn.com/learn/regex101

## 4. Combining Selection and Filtering

Use method chaining to append selection to the resutls of a filter before assigning it to a new variable

In [29]:
# get a list of majors for students under 21
young_majors = df[df['age'] <  21]['major'].to_list()
# first command returns a data frame, then you can use ['major']
print(young_majors) # series, cuz just one column
# if you wanted to make it a list, you can use to_list

['Chemistry', 'Chemistry', 'Chemistry', 'Computer Science', 'Computer Science', 'Chemistry', 'Mathematics', 'Physics', 'Physics', 'Mathematics', 'Computer Science', 'Physics', 'Mathematics', 'Computer Science', 'Biology', 'Mathematics', 'Chemistry', 'Mathematics', 'Engineering', 'Computer Science', 'Physics', 'Mathematics', 'Mathematics', 'Physics', 'Computer Science', 'Mathematics', 'Mathematics', 'Chemistry', 'Biology', 'Physics', 'Computer Science', 'Computer Science', 'Computer Science', 'Computer Science', 'Physics', 'Physics']


## 5. Using .isin() for Multiple Values

`.isin` is useful for filtering rows that meet any of a list of criteria. For example, fitlering by a subset of majors

only runs on series

Can only filter by: A string, a list of columns, a list of TF's . All need to be the length of the df
```python
df['string'], df[['list', 'of', 'columns'], df[True, False, False,...]
```

Most useful for filtering categorical data

In [31]:
stem_majors = df[ df['major'].isin(['Engineering', 'Chemistry', 'Physics']) ]
stem_majors.head()

Unnamed: 0,student_id,age,gpa,major
1,1001,21,2.39,Physics
2,1002,22,2.09,Physics
4,1004,20,2.78,Chemistry
5,1005,22,2.54,Engineering
8,1008,19,2.56,Chemistry


## 6. Filtering with String Methods

Pandas provides string methods that can be used to filter text data

What can we put in our conditional that will give us a bunch of T/F?

In [33]:
# filter majors that contain the word science
science_majors = df[ df['major'].str.contains('Science')]
# take str commands, then find where it contains science
print(science_majors.head())

    student_id  age   gpa             major
3         1003   24  2.65  Computer Science
11        1011   20  3.60  Computer Science
12        1012   20  2.15  Computer Science
28        1028   23  2.62  Computer Science
29        1029   22  2.65  Computer Science


## 7. Advanced Selection: .loc vs .iloc


In [None]:
df['string'], df[['list', 'of', 'columns'], df[True, False, False,...]

## Conclusion
