### Data Filtering:
In general, filtering means throwing out irrelevant stuff (could be harmful) from something. For example, water filtering, air filtering, etc.

Similarly in data science, filtering means refining datasets into simply what a user needs, without including other data that are irrelevant or sensitive. Different types of strategies can be used to filter out data depending on the user's needs.

A real world example of filtering can be preventing access to sensitive information. For example, a data filter may remove Social Security numbers or credit card numbers from client data before an employee can start working on them.  


| UserName | credit amount | Gender | Billing Date | CreditCard | SSN    |  
| ---------| -----------   | ------ | ------------ |----------- |------- | 
| Mason    |  5000         |    M   |  18th        | 3827383728 | 3242742 |  
| Andrew   |  3.1          |    M   |  22nd        | 7492748833 | 9873892 |  
| Trevor   |  3.7          |    M   |  11th        | 2618193385 | 7847284 | 
| Anthony  |  3.33         |    M   |  13th        | 8901284722 | 5729175 |  
| Robinson |  3.45         |    M   |  18th        | 4729108379 | 4110394 |  
| Alice    |  3.18         |    F   |  20th        | 7191039091 | 8918383 |  
| Katie    |  3.25         |    F   |  7th         | 4739378748 | 1129483 |  

##### After filtering, this data will be sent to an employee as below:

| UserName | credit amount | Gender | Billing Date |   
| ---------| -----------   | ------ | ------------ |  
| Mason    |  5000         |    M   |  18th        | 
| Andrew   |  3.1          |    M   |  22nd        | 
| Trevor   |  3.7          |    M   |  11th        | 
| Anthony  |  3.33         |    M   |  13th        | 
| Robinson |  3.45         |    M   |  18th        | 
| Alice    |  3.18         |    F   |  20th        |
| Katie    |  3.25         |    F   |  7th         |

### The need for filtering data in a dataset:

In data science, a lot of the time data scientists have to work with huge datasets. But for a specific analysis, not all data is important to use. In those cases, filtering is very helpful. It also helps you focus on specific data rather than looking at the huge dataset all the time.

Suppose you have a population dataset where individual population data are stored like **age, height, weight, hair_color, eye_color, and BMI** . You are asked to find out how a person's height changes with respect to their age. 

For this scenario, you actually don't need any other information except the **height** and **age** columns. So, we can just filter out the **height, age** columns and do further analysis.

This is how filtering helps in analysis of data and allows us to focus only on the required information.

### Discussion Activity

Suppose we have some data on GPA, GRE and gender. We want to see the student's performance based on gender (male/female).

| GRE         | GPA         | Gender |
| ----------- | ----------- | ------ |
| 316         |  3.4        |    M   |
| 308         |  3.1        |    M   |
| 327         |  3.7        |    F   |
| 310         |  3.33       |    F   |
| 305         |  3.45       |    M   |
| 322         |  3.18       |    F   |
| 316         |  3.25       |    M   |
| 300         |  3.4        |    F   |
| 310         |  3.6        |    F   |


#### Let's say we want to find the following information:
    - What is the average GRE score in both the gender groups? Which group has a better average score?
    - What is the average GPA in both the gender groups? Which group has a better average GPA?
    - What is the maximum GPA from the males?
    - What is the minimum GRE score among the females?
    
How can we find these information from the given data?    

### Extended example

Suppose we have this dataset, named `mpg.csv`, where we have information about cars. First, we load the dataset into a dataframe and see its contents.


<details>
<summary>Blockly instructions: (click triangle to expand)</summary>
<br>    
    
- Using the IMPORT menu in the Blockly palette, click on an import block `import some library as variable name`:
- After clicking , the block drops into blockly workspace. Change `some library` to `pandas` by typing into that box.
- Click on the `variable name` dropdown, choose `Rename variable...`, and type `pd` into the box that pops up.
- Click on "Blocks to Code" at the bottom of the blockly palette.      
    
Run the python cell by pressing Shift+Enter.    
</details>


### Import pandas library

### Read CSV data into dataframe


<details>
<summary>Blockly instructions: (click triangle to expand)</summary>
<br>
    
- Go to the VARIABLES menu in the Blockly palette and click on the `with pd do ...using` block.
- After it drops into the Blockly workspace, wait a second until the dropdown stops loading, and then click on it and select `read_csv`.
- Then get a `" "` block from TEXT, drop it on the workspace, drag it to the `using` part of the first block, and type the file path `datasets/mpg.csv` into it.
        
</details>

<details>
<summary>Blockly instructions: (click triangle to expand)</summary>
<br>

- Using VARIABLES menu in the Blockly palette, click on `Create variable...` and type `car_data` into the pop up window.
- Then click on the `set car_data to` block 
- Connect `set car_data` block to previous `with pd do...` block.
- Click on "Blocks to Code" at the bottom of the blockly palette.
    
Run the python cell by pressing Shift+Enter.    
</details>

### View the dataframe

<details>
<summary>Blockly instructions: (click triangle to expand)</summary>
<br>
    
- Click the variable block  `car_data` in the blockly workspace.
- Click on "Blocks to Code" at the bottom of the blockly palette.     
    
Execute the code using Shift+Enter in the python file.
</details>


## Look at the features in the dataset:

<details>
<summary> Blockly Instructions(click triangle to expand) </summary>
<br>
    
- From `variables` tab, click `from car_data get...` block into the workspace.
- In the block, click on get and select the option `columns`.
- Click on "Blocks to Code" at the bottom of the blockly palette.

Run the python code cell.
</details>



After looking at the car features in car_data, let's suppose as a customer, we want to buy cars with only 4 cylinders (assuming higher cylinder cars are more expensive). So, we filter our car data by only keeping the cars having 4 cylinders and store them in a new variable.



<details>
<summary> Blockly Instructions(click triangle to expand) </summary>
<br>

- from `LISTS`, click on `{dictvariable}[]`. place `car_data` inplace of `dictvariable`.
get a text block `" "`. write "cylinders" in the `" "` block. drag it inside car_data.
- from `LOGIC` block, click on `equals` block. on the left side of it, put the `car_data list with cylinders`. On the right side put a math `123` block. put `4` on it.
- from lists, click on `{dictvariable}[]`. place `car_data` inplace of `dictvariable`.Put the logic block inside the new `car_data` block.
- create a variable `cylinder4`. click on set `cylinder4` to and connect the `car_data` block.
- From `VARIABLES` block, click the `cylinder4` block.    
- Click on "Blocks to Code" at the bottom of the blockly palette.

Run the python code cell.
</details>

After getting all the 4 cylinder cars from the dataset, we are interested in buying a 'toyota corolla' car, which we are a huge fan of. So, from our 4 cylinder list, we filter out available 'toyota corolla' cars there and store them in a new variable. 


<details>
<summary> Blockly Instructions(click triangle to expand) </summary>
<br>

- from `LISTS`, click on `{dictvariable}[]`. place `cylinder4` inplace of `dictvariable`.
get a text block `" "`. write "name" in the `" "` block. drag it inside `cylinder4`.
- from `LOGIC` block, click on `equals` block. on the left side of it, put the `cylinder4 list with name`. On the right side put a `" "` block. put `toyota corolla` on it.
- from lists, click on `{dictvariable}[]`. place `cylinder4` inplace of `dictvariable`.Put the logic block inside the new `cylinder4` block.
- create a variable `toyota_cars`. click on set `toyota_cars to` and connect to the `cylinder4` block.
- From `VARIABLES` block, click the `toyota_cars` block.    
- Click on "Blocks to Code" at the bottom of the blockly palette.

Run the python code cell.
</details>

There is lots of information shown about the car (mpg, cylinders, etc). But, we are interested in only a few of the variables here. We only want information about name, mpg, acceleration, and model_year of the cars. So, we filter only those required information next.


<details>
<summary> Blockly Instructions(click triangle to expand) </summary>
<br>


- From `LISTS` block , click on `{dictVariable} [ ]` block. Put `toyota_cars` 
in place of {dictVariable}. 
- From `LISTS`, click on `create list with` block.Click on + sign on the list block 4 times. Take 4 `" "` block from `TEXT` block.
write `name, mpg, model_year, acceleration` in the text blocks. Drag them inside `create list with` block.
- Drag the `create list with` block inside `toyota_cars`
- Create a variable `toyota_info`. click on set `toyota_info` to and connect to the `toyota_cars` block.
- From `VARIABLES` block, click the `toyota_info` block.     
- Click on "Blocks to Code" at the bottom of the blockly palette.

Run the python code cell.
</details>

Now, we heard that cars with models made after 1980 are the most fuel efficient cars. So, from our new filtered data (toyota_info), we can find only the cars having those attributes.



<details>
<summary> Blockly Instructions(click triangle to expand) </summary>
<br>

- from `LISTS`, click on `{dictvariable}[]`. place `toyota_info` inplace of `dictvariable`.
get a text block `" "`. write "model_year" in the `" "` block. drag it inside `toyota_info`.
- from `LOGIC` block, click on `equals` block. on the left side of it, put the `toyota_info list with model_year`. On the right side put a math `123` block. put `80` on it. Click on the middle of the logic block and select `>` sign.
- from lists, click on `{dictvariable}[]`. place `toyota_info` inplace of `dictvariable`.Put the logic block inside the new `toyota_info` block.
- create a variable `good_mileage`. click on set `good_mileage` to and connect the `toyota_info` block.
- From `VARIABLES` block, click the `good_mileage` block.    
- Click on "Blocks to Code" at the bottom of the blockly palette.

Run the python code cell.
</details>

Finally, after filtering by our requirements to buy our preferred kind of car, we have a list of only 2 cars with the required features from an initial list of almost 400 cars. By using filtering in a big dataset, we can more easily find the data we care about.