# Pandas

Pandas is a Python library that includes a type of data structure called `dataframe` as well as associated code that helps manipulate and analyze data. Dataframes are very useful for storing and manipulating data in tabular form, with rows and columns. In this notebook, we will demonstrate various features for manipulating data using pandas dataframes.

Let's start by importing the `pandas` library. We will also import `numpy` so we can use the `np.where()` method later on.

The code below creates a dictionary containing some information and uses pandas to convert it to a dataframe.

Run the code to see the pandas dataframe.

The data for this example are taken from: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_top_goalscorers.

In [None]:
import pandas as pd
import numpy as np

# Create a dictionary with information about some people - in this case the top 5 goal scorers at the FIFA World Cup
my_dict = { "Name": ["Miroslav Klose", "Ronaldo", "Gerd Muller", "Just Fontaine", "Lionel Messi"], 
            "Country": ["Germany", "Brazil", "West Germany", "France", "Argentina"], 
            "Goals": [16, 15, 14, 13, 13,], "Games":[24, 19, 13, 6, 26]}

# Convert the dictionary into a pandas dataframe object
df = pd.DataFrame(my_dict)

# Show first few rows of the data
df.head()

## Dataframe Features

Now that we have the data stored in a dataframe, we can analyze the data and do some operations on the data.

The `shape` property returns the dimensions of the dataframe - basically the number of rows and number of columns. The `shape` property is a tuple whose first value is the number of rows and second value it number of columns.

The `describe()` method returns some **summary statistics** for the data.

In [None]:
# Get the shape of the data in the format (rows, cols)
df.shape

In [None]:
# The describe method gives some summary stats
df.describe()

## Referencing Parts of the Dataframe

You can reference a row, a column or a particular value.

To get a whole column, just use the column name as an index: `df['Goals']` for example.

In [None]:
# Get the column 'Goals'
df['Goals']

You can reference a particular table value with the `iloc` method, giving the numerical value of the row and column. Think of `iloc` as being short for index or integer location.

The format is:
```
df.iloc(row_value, column_value)
```
Remember that indexing starts at 0 for both rows and columns, and when we reference rows and columns we do rows, then columns (RC).

In [None]:
# Print the entry in first row (row 0) and second column (column 1) of the data frame
print(df.iloc[0, 1])

In [None]:
# Print the entry in the second row and third column of the data frame
print(df.iloc[1, 2])

You can also use `iloc` to get a particular row.

In [None]:
# Print row 1 (second row) of the dataframe
print(df.iloc[1])

You can use `iloc` to get a 'slice' of the dataframe.

In [None]:
# Print rows 2-3 and columns 1-2 from the dataframe
print(df.iloc[2:4, 1:3])

# Modifying Data
We can add or remove rows and columns and modify the data in particular locations. We can also make changes based on conditions.

We will do the following with this example dataframe:

1. Change the title of the 'Games' column to 'Games Played'.
2. Change Ronaldo's name to "Ronaldo Luís Nazário de Lima".
3. Add a column called 'Goals per Game' to represent the scoring rate: goals / games played.
4. Add a column called '20+ Games' which will indicated whether the player played 20 or more games.
5. Drop the 'Games Played' column.

In [None]:
# Rename 'Games' column to 'Games Played'
# The 'inplace=True' argument changes the existing dataframe, rather than creating a copy
df.rename(columns={"Games":"Games Played"}, inplace=True)
df

In [None]:
# Change Ronaldo's name to his full name (to avoid confusion with the other Ronaldo!)
# Note: You can also use df.loc[1, "Name"] here to reference the column by its label
df.iloc[1, 0] = "Ronaldo Luís Nazário de Lima"
df

In [None]:
# Create a new column which gives the scoring rate based on goals / games played.
df["Goals per Game"]=df["Goals"] / df["Games Played"]
df

In [None]:
# Use np.where() to create a column that indicates whether player played 20 games or more
df['20+ games'] = np.where(df['Games Played'] >= 20, 'Yes', 'No')
df

In [None]:
# Drop the column 'Games Played'
# The 'inplace=True' argument changes the existing dataframe, rather than creating a copy
df.drop(columns=['Games Played'], inplace=True)
df

## Exercise 1

In this exercise, you will create a dataframe from a dictionary that lists the top ten fourteeners in Colorado and then carry out some standard operations.

Before doing task 1, run the code cell below to load the dictionary into a dataframe called `ft`.

In [None]:
# Here is the dictionary
ftns_dict = {'Rank': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7', 7: '8', 8: '9', 9: '10'},
             'Summit Name': {0: 'Mount Elbert', 1: 'Mount Massive', 2: 'Mount Harvard', 3: 'Blanca Peak', 4: 'La Plata Peak', 5: 'Uncompahgre Peak', 6: 'Crestone Peak',
                             7: 'Mount Lincoln', 8: 'Grays Peak', 9: 'Castle Peak'},
             'Range': {0: 'Sawatch', 1: 'Sawatch', 2: 'Sawatch', 3: 'Sangre de Cristo', 4: 'Sawatch', 5: 'San Juan', 6: 'Sangre de Cristo', 7: 'Tenmile-Mosquito', 8: 'Front Range', 9: 'Elk'},
             'Elevation': {0: 14438, 1: 14427, 2: 14424, 3: 14350, 4: 14344, 5: 14318, 6: 14299, 7: 14293, 8: 14275, 9: 14274},
             'Prom.': {0: 9098, 1: 1983, 2: 2340, 3: 5335, 4: 1846, 5: 4249, 6: 4584, 7: 3860, 8: 2782, 9: 2358}}

# Add your code here to read the dictionary into a dataframe called 'ft'
ft = pd.DataFrame(ftns_dict)

 1. Use the `head()` method to see what the dataframe looks like.

In [None]:
# Type your code here


2. Use the `shape` property to see how many rows and columns there are.

In [None]:
# Type your code here


3. Rename the `'Elevation'` column `'Elevation (ft)'` and check you've done this. Remember to use `inplace=True`.

In [None]:
 # Type your code here


4. Delete the `'Prom.'` column and check you've done this. Make sure you use `inplace=True` to store the result back in the `ft` dataframe.

In [None]:
 # Type your code here


5. Create a new column called `'Elevation (m)'` that gives the elevation of each mountain in metres. Use 1 ft = 0.3048 m as your conversion factor. Check that you have made this change.

In [None]:
 # Type your code here


6. Finally, use the `describe()` method to see a statistical summary of the data. What is the average elevation of the ten highest mountains in Colorado?

In [None]:
# Type your code here


## Good Job!

You've made it to the end of the first `pandas` notebook. 

If you have time, you should start looking at the **Pandas Extension**, which is in a separate notebook.