# Module 3.2 - Pandas Basics

This Jupyter Notebook gives instructions and practice assignments about Module 3.2, subject Pandas Basics.

Carefully read below text and instructions. Any assignments for you are numbered.

Good luck!

## DataFrame basics

Apart from lists, tuples and arrays, there are also dataframes. Let's practice working with dataframes. For this we use the package `pandas`. 

To use `pandas` we first need to import it:

In [1]:
import pandas as pd

### 1. Creating a dataframe

A pandas dataframe can be created in the following manner, using a dictionary like this: `DF_example = pd.DataFrame(DICTIONARY_TO_USE)`. In the `DICTIONARY_TO_USE`, the `keys` will become the column names, and the `values` can be collections of data that will be in those columns.

Here is an example: 

In [2]:
DF_example = pd.DataFrame({'Numbers':[10, 20, 30, 40, 50], 'Letters':['A', 'B', 'C', 'D', 'E'], 'Words':['Car', 'House', 'Tree', 'Road', 'Sky']})
DF_example

Unnamed: 0,Numbers,Letters,Words
0,10,A,Car
1,20,B,House
2,30,C,Tree
3,40,D,Road
4,50,E,Sky


1. Now try to create your own dataframe, using the dictionary `people_dict`.

In [None]:
people_dict = {'Name':['John', 'Jane', 'Mark', 'Anna', 'Paul'], 'Weight':[45, 67, 77, 89, 102], 'Height':[1.23, 2.01, 1.88, 1.79, 1.56], 'Age':[13, 21, 76, 65, 44]}
people_dict

In [None]:
#1 Create your own dataframe
myDF = 

### 2. Select data from a dataframe

The next step is to select data from your dataframe. Every dataframe has *rows* and *columns*. For `myDF` there are five rows with index 0 to 4, and four columns named `Name`, `Weight`, `Height` and `Age`. There are many ways in which you can select data. Here are some methods you can use:
* `[]`               
    * Can be used for single *columns*
* `.get()`
    * Can be used for multiple *columns*
* `.loc[]`
    * To select *rows* and *columns* using the labels/names
* `.iloc[]`
    * To select *rows* and *columns* using the relative position/indexes
    
So, `myDF[]` can be used to select a column. Here is an example:



In [None]:
myDF['Name']

2. Now try to select the column `Weight`.

In [None]:
#2

To select multiple columns, you can use `.get()`. You should use a list of column names as input, like this: `myDF.get(LIST_WITH_COLUMN_NAMES)`.

Here is an example:

In [None]:
myDF.get(['Name', 'Weight'])

3. Now us the `.get()` method to select the columns `Name`, `Height` and `Age`. 

In [None]:
#3

Instead of selecting columns, you can also select rows. This can be done by using the method `.loc[]`. Here is an example of getting the first row:

In [None]:
myDF.loc[0]

4. Now get the second row using `.loc[]`.
5. Get the fifth row.

In [None]:
#4

In [None]:
#5

`.loc[]` can also be used to select a row *and* a column. For example, select of the *third row* only the *`Weight` column*:

In [None]:
myDF.loc[2, 'Weight']

6. Get the second value from the `Name` column.
7. From the `Age` column, get the first value.

In [None]:
#6 

In [None]:
#7

With `.loc[]` you can also select *multiple* rows and *multiple* columns. The labels for the rows and columns have to be a `list`. For example, `myDF.loc[1, 2, 3, 'Weight', 'Height']` does not work, but `myDF.loc[[1, 2, 3], ['Weight', 'Height']]` does work. 

8. Correct the code below.

In [None]:
myDF.loc[1, 2, 3, 'Weight', 'Height']

You can also use *slicing* to select rows with `.loc[]`, like this:

In [None]:
myDF.loc[2:4, ['Name', 'Age']]

9. Select the first and last row of columns `Name`, `Height` and `Age`. 
10. Select the last three rows of columns `Height` and `Age`, using slicing. 

In [None]:
#9

In [None]:
#10

Instead of using the names of the rows and columns with `.loc[]`, you can also select rows and columns using their *index*. This is done by using `.iloc[]` (the `i` is for index). 

Here is an example of selecting the second row (index = 1) of the `Weight` column (second column; index = 1): 

In [None]:
myDF.iloc[1, 1]

Now you can also use slicing for the columns. 

11. Select the second row of the `Height` column.
12. Select the last row of all except the first column.
13. Select all rows except the last one for the two first columns.

In [None]:
#11

In [None]:
#12

In [None]:
#13

We now create a new dataframe called `peoplesDF`. 

In [None]:
# Creating peoplesDF
import numpy as np
np.random.seed(1)
peoples_dict = {'Name':['Machiel', 'Thomas', 'Tjerk', 'Pau', 'Wilbert', 'Corstiaan', 'Carolien', 'Annebeth'], 
                'Favorite_fruit':['Banana', 'Mango', 'Orange', 'Banana', 'Pineapple', 'Watermelon', 'Grapefruit', 'Pear'], 
                'Score':np.random.randn(8)*100, 
                'Grade1':np.random.choice(range(1, 11), size=8), 
                'Grade2':np.random.choice(range(1, 11), size=8), 
                'Grade3':np.random.choice(range(1, 11), size=8)}
peoplesDF = pd.DataFrame(peoples_dict)
peoplesDF

For every exercise, select the highlighted values using any of the methods discussed above:

14. ![image-3.png](attachment:image-3.png)

In [None]:
#14

15. ![image.png](attachment:image.png)

In [None]:
#15

16. ![image.png](attachment:image.png)

In [None]:
#16

### 3. Selecting data using booleans

You can also use booleans to select data from a dataframe. Booleans only work for `.loc()`.

Here are some examples:

In [None]:
# Getting all grades higher than 5 for the column Grade1
peoplesDF['Grade1'].loc[peoplesDF['Grade1'] > 5]

In [None]:
# Getting only grades between 2 and 8 for the column Grade2
# (Use the '&' symbol and put the different conditions between brackets)
peoplesDF.loc[(peoplesDF['Grade2'] > 2) & (peoplesDF['Grade2'] < 8), 'Grade2'] 

In [None]:
# Only getting the rows for which the Score is higher than 100
peoplesDF.loc[peoplesDF['Score'] > 100]

In [None]:
peoplesDF # Let's show the dataframe again to use it for the following exercises

Now try yourself by completing the following exercises:

17. Select all rows for which the `Score` is positive.
18. Select all values from column `Grade3` that are higher than or equal to 8.
19. Select all rows for which the value in column `Grade3` is even and higher than 5 (remember **&** and **( )**).

In [None]:
#17

In [None]:
#18

In [None]:
#19

### 4. Renaming and changing a dataframe

Dataframe columns can be renamed using the function `YOUR_DATAFRAME.rename(columns={OLD_NAME_1:NEW_NAME_1, OLD_NAME2:NEW_NAME2)`. 

Here is an example of changing the column name `Score` to `Highest_score` in the `peoplesDF` (note that you have to redefine `peoplesDF`, using `peoplesDF = ...`, to make the change permanent):

In [None]:
peoplesDF = peoplesDF.rename(columns={'Score':'Highest_score'})
peoplesDF

20. Now change the column names `Grade1`, `Grade2` and `Grade3` into `GradeA`, `GradeB` and `GradeC`. 

In [None]:
#20

Columns can also be dropped by using the `.drop()` function. You need to provide one or more column names using the parameter `columns=`. You can use a single column name (`columns='example_name'`) or multiple column names in a list (`columns=['example_name1', 'example_name2']`). 

For example, if we want to drop the column `Favorite_fruit` we can use the following code:

In [None]:
peoplesDF = peoplesDF.drop(columns='Favorite_fruit')
peoplesDF

21. Drop the columns `GradeA`, `GradeB` and `GradeC`.

In [None]:
#21

You can also add new columns to the dataframe. This is done in the same way as adding a key:value pair to a dictionary. You can only add one column at a time. 

For example, let's add the column `Test` to `peoplesDF`:

In [None]:
peoplesDF['Test'] = [100, 200, 300, 400, 500, 600, 700, 800]
peoplesDF

22. Now add the columns `GradeD`, `GradeE` and `GradeF` to `peoplesDF` (the values are provided in the lists below). 

In [None]:
#22
GradeD = [3, 2, 7, 9, 6, 8, 10, 9]
GradeE = [8, 8, 7, 8, 6, 7, 8, 8]
GradeF = [10, 9, 10, 10, 10, 8, 9, 9]
