<a href="https://colab.research.google.com/github/guyfrancis/dat1001/blob/main/Pandas_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas (Part 1)

Pandas is a Python library that includes a type of data structure called **dataframe** as well as associated code that helps manipulate and analyze data.
Dataframes are very useful for storing and manipulating data in tabular form, with rows and columns.

In this notebook, we will demonstrate various features for manipulating data in using pandas dataframes.

Let's start by importing the pandas library. (We will also import numpy.)
We will then create a dictionary containing information about certain people and we will use pandas to convert it to a dataframe.

In [1]:
import pandas as pd
import numpy as np

# Create a dictionary with information about some people
my_dict = { "Name": ["Bob", "Charlie", "Elise", "Darius"], "Age": [32, 47, 25, 19], "Height (ft)": [5.9, 6.0, 5.2, 6.2]}

# Convert the dictionary into a pandas dataframe object
df = pd.DataFrame(my_dict)

# the head() method shows the column headers and the first few rows of the data
# In this case, as there are only a few rows, it will show all the rows
df.head()

Unnamed: 0,Name,Age,Height (ft)
0,Bob,32,5.9
1,Charlie,47,6.0
2,Elise,25,5.2
3,Darius,19,6.2


## Dataframe operations

Now that we have the data stored in a dataframe, we can analyze the data and do some operations on the data.

Firstly, the describe() method returns some summary statistics for the data.

In [2]:
# The describe method gives some summary information about the data
df.describe()

Unnamed: 0,Age,Height (ft)
count,4.0,4.0
mean,30.75,5.825
std,12.065792,0.434933
min,19.0,5.2
25%,23.5,5.725
50%,28.5,5.95
75%,35.75,6.05
max,47.0,6.2


The **shape** property returns the dimensions of the dataframe - basically the number of rows and number of columns.

The shape property is a **tuple** whose first value is the number of rows and second value it number of columns.

In [None]:
# Get the shape of the data in the format (rows, cols)
df.shape

You can use indexing to get just the number of rows or just the number of columns from the shape tuple.

In [None]:
# Use indexing to print number of rows in the dataframe
print("Number of rows:", df.shape[0])
# Use indexing to print number of columns in the dataframe
print("Number of columns:", df.shape[1])

If you want a whole column, you can use the column title as a reference.

In [None]:
# Get the column 'Age'
df['Age']

## Using **iloc**

You can reference a particular table value with the **iloc** method, giving the numerical value of the row and column. Think of **iloc** as being short for index or integer location.

The format is:
```
df.iloc(row_value, column_value)
```
Remember that indexing starts at 0 for both rows and columns.


In [None]:
# Get the entry in first row (row 0) and second column (column 1) of the data frame
df.iloc[0, 1]

In [None]:
# Get the entry in the second row and third column of the data frame
df.iloc[1, 2]

You can use **iloc** to get a particular row

In [None]:
# Get row 1 (second row) of the dataframe
df.iloc[1]

You can use **iloc** to get a 'slice' of the dataframe.

In [None]:
# Grab rows 2 and 3 and columns 1 and 2 from the dataframe
df.iloc[2:4, 1:3]

## Exercise

1. Use the column title 'Height (ft)' to get the last column of the dataframe.
2. Use the iloc() method to get Charlie's height.
3. Use the iloc() method with ':' to get the first two rows and columns only of the data.

Type your code in the cell below.

In [None]:
# Type your code here


In [None]:
# Solution
print(df['Height (ft)'])
print(df.iloc[1, 2])
print(df.iloc[0:2, 0:2])

# Modifying data
We can add or remove rows and columns and modify the data in particular locations. We can also make changes based on conditions.

We will do the following with this example dataframe:

1. Change the title of the 'Name' column to 'First Name'.
2. Change Bob's height to 5.8 ft
3. Add a column called 'Over 40' and set it to 'Yes' if the person is over 40.
4. Create a new column called 'Height (m)' which gives everyone's height in metres.

In [None]:
# Rename 'Name' column to 'First Name'
df.rename(columns={'Name':"First Name"})

In [None]:
# Change Bob's height to 5.8
# His height is in row 0 column 2
# We will use the iloc(row, col) method
df.iloc[0, 2] = 5.8
df

In [None]:
# Create a new column called 'Over 40' with default value '-'
df['Over 40']='-'
df

In [None]:
# Finally we will update the 'Over 40' column so that it says 'Yes' if the person is over 40 and 'No' otherwise.
# Note that this change requires using the where() method from the numpy library
df['Over 40'] = np.where(df['Age'] > 40, 'Yes', 'No')
df

In [None]:
## Lastly, we will add a column for people's height in metres. 1 foot is equal to 0.3048 metres. We can do this in one line!
df['Height (m)'] = df['Height (ft)']*0.3048
df



## Exercise

In this exercise, you will create a dataframe from a dictionary that lists the top ten fourteeners in Colorado.

In this exercise, there are 8 tasks, starting with task 1 below.

1. Use the pd.DataFrame(my_dict) method to read the dictionary below into a dataframe. Add your code to the cell below and then run it.

In [None]:
# Here is the dictionary
ftns_dict = {'Rank': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7', 7: '8', 8: '9', 9: '10'},
             'Summit Name': {0: 'Mount Elbert', 1: 'Mount Massive', 2: 'Mount Harvard', 3: 'Blanca Peak', 4: 'La Plata Peak', 5: 'Uncompahgre Peak', 6: 'Crestone Peak',
                             7: 'Mount Lincoln', 8: 'Grays Peak', 9: 'Castle Peak'},
             'Range': {0: 'Sawatch', 1: 'Sawatch', 2: 'Sawatch', 3: 'Sangre de Cristo', 4: 'Sawatch', 5: 'San Juan', 6: 'Sangre de Cristo', 7: 'Tenmile-Mosquito', 8: 'Front Range', 9: 'Elk'},
             'Elevation': {0: 14438, 1: 14427, 2: 14424, 3: 14350, 4: 14344, 5: 14318, 6: 14299, 7: 14293, 8: 14275, 9: 14274},
             'Prom.': {0: 9098, 1: 1983, 2: 2340, 3: 5335, 4: 1846, 5: 4249, 6: 4584, 7: 3860, 8: 2782, 9: 2358}}

# Add your code here to read the dictionary into a dataframe called 'ftns'
ftns = pd.DataFrame(ftns_dict)

 2. Use the **head()** method to see what the dataframe looks like

In [None]:
# Type your code here
ftns.head()

3. Use the **shape** property to see how many rows and columns there are.

In [None]:
# Type your code here
ftns.shape

4. Rename the 'Elevation' column 'Elevation (ft)'
5. Rename the 'Prom.' column 'Prominence (ft)'

In [None]:
 # Type your code here
ftns = ftns.rename(columns={'Elevation': 'Elevation (ft)', 'Prom.':'Prominence (ft)'})
ftns

6. Delete the 'Range' column and store the resulting dataframe back in the variable ftns

In [None]:
ftns = ftns.drop(columns=['Range'])
ftns

7. Create two new columns called 'Elevation (m)' and 'Promninence (m)' that gives the elevation and prominence of each mountain in metres. Remember 1 ft = 0.3048 m.

In [None]:
# Type your code here
ftns['Elevation (m)']=ftns['Elevation (ft)']*0.3048
ftns['Prominence (m)']=ftns['Prominence (ft)']*0.3048

In [None]:
ftns

8. Finally, use the describe() method to see a statistical summary of the data

In [None]:
# Type your code here
ftns.describe()