<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org or zhuo.chen@ithaka.org<br />
___

# Pandas 1

**Description:** This notebook describes how to:
* Create a Pandas Series or a DataFrame
* Display a dataframe
* Add new data to a dataframe
* Sort a dataframe
* Select data from a dataframe

This is the first notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 90 minutes

**Data Format:** CSV (.csv)

**Libraries Used:** Pandas

**Research Pipeline:** None
___

## Intro to Pandas

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png" width="500"></center>

Pandas is a Python library that allows you to easily work with tabular data. Most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.

**Advantages of Spreadsheet Software**
* Point and click
* Easier to learn
* Great for small datasets (<10,000 rows)
* Better for browsing data

**Advantages of Pandas**
* More powerful data manipulation with Python
* Can work with large datasets (millions of rows)
* Faster for complicated manipulations
* Better for cleaning and/or pre-processing data
* Can automate workflows in a larger data pipeline

In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation.

In [1]:
# Install pandas
!pip install pandas



In [2]:
# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

## Pandas Series and Pandas DataFrame
Pandas can interpret a wide variety of data sources, including Excel files, CSV files, and Python objects like lists and dictionaries. Pandas converts these into two fundamental objects: 

* Pandas Series - a single column of data
* Pandas DataFrame - a table of data containing multiple columns and rows

### Pandas Series

We can think of a Series as a single column of data. Let's create a Series based on the champions of the last ten FIFA World Cup games. 

|World Cup Winner|
|---|
|To be determined|
|France|
|Germany|
|Spain|
|Italy|
|Brazil|
|France|
|Brazil|
|Germany|
|Agentina|

We will put these country names into a Pandas Series. To create our Series, we pass a **list** into the Series method:

In [3]:
# Create a data series object in Pandas
champions = pd.Series([
    "To be determined", 
    "France", 
    "Germany", 
    "Spain", 
    "Italy", 
    "Brazil", 
    "France", 
    "Brazil", 
    "Germany", 
    "Agentina"
])

In [4]:
# Print out the Series
print(champions)

0    To be determined
1              France
2             Germany
3               Spain
4               Italy
5              Brazil
6              France
7              Brazil
8             Germany
9            Agentina
dtype: object


We can assign a name to our series using `.name`.

In [5]:
# Give our series a name
champions.name = 'World Cup Champion'
champions

0    To be determined
1              France
2             Germany
3               Spain
4               Italy
5              Brazil
6              France
7              Brazil
8             Germany
9            Agentina
Name: World Cup Champion, dtype: object

### Pandas DataFrame

While a Pandas Series is a single column of data, a Pandas DataFrame contains multiple columns and rows. 

|Year|Champion|Host|
|---|---|---|
|2022|To be determined|Qatar|
|2018|France|Russia|
|2014|Germany|Brazil|
|2010|Spain|South Africa|
|2006|Italy|Germany|
|2002|Brazil|Japan and South Korea|
|1998|France|France|
|1994|Brazil|USA|
|1990|Germany|Italy|
|1986|Agentina|Mexico|

We will create a Pandas DataFrame based on this table. To create our dataframe, we pass a **dictionary** into the DataFrame method:

In [6]:
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990, 
                              1986],
                          
                         "Champion": ["?", 
                                      "France", 
                                      "Germany", 
                                      "Spain", 
                                      "Italy", 
                                      "Brazil", 
                                      "France", 
                                      "Brazil", 
                                      "Germany", 
                                      "Agentina"],
                          
                         "Host": ["Qatar", 
                                  "Russia", 
                                  "Brazil", 
                                  "South Africa", 
                                  "Germany", 
                                  "Japan and South Korea", 
                                  "France", 
                                  "USA", 
                                  "Italy", 
                                  "Mexico"]
                         })

wcup

Unnamed: 0,Year,Champion,Host
0,2022,?,Qatar
1,2018,France,Russia
2,2014,Germany,Brazil
3,2010,Spain,South Africa
4,2006,Italy,Germany
5,2002,Brazil,Japan and South Korea
6,1998,France,France
7,1994,Brazil,USA
8,1990,Germany,Italy
9,1986,Agentina,Mexico


In a Pandas dataframe, each column is a Pandas series. 

In [7]:
# Get the type of a column in a dataframe
type(wcup['Champion'])

pandas.core.series.Series

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
You are a middle school teacher. You teach the Butterfly Class and the Hippo Class. Last week, the Butterfly class had an English test and a math test. You would like to make a dataframe to record the English grades and Math grades of the students in the Butterfly Class. 

Make a dataframe with three columns: name, English and Math. 

## Data Manipulation

When we get a dataset, we would want to get a general idea of how the dataset is like.

### Get the size of the data
To get how many rows and columns a dataframe has, we use the `.shape` attribute. `df.shape` returns a tuple with (number of rows, number of columns).

In [8]:
# Get how many rows a dataframe has
wcup.shape[0]

10

In [9]:
# Get how many columns a dataframe has
wcup.shape[1]

3

### Display the data
We can display part of the dataframe to get a peek at the data. The `.head()` and `tail()` methods help us do that.

In [10]:
# Display the first five rows of the data
wcup.head()

Unnamed: 0,Year,Champion,Host
0,2022,?,Qatar
1,2018,France,Russia
2,2014,Germany,Brazil
3,2010,Spain,South Africa
4,2006,Italy,Germany


In [11]:
# Display the last five rows of the data
wcup.tail()

Unnamed: 0,Year,Champion,Host
5,2002,Brazil,Japan and South Korea
6,1998,France,France
7,1994,Brazil,USA
8,1990,Germany,Italy
9,1986,Agentina,Mexico


In [12]:
# Specify the number of rows to display
wcup.head(8)

Unnamed: 0,Year,Champion,Host
0,2022,?,Qatar
1,2018,France,Russia
2,2014,Germany,Brazil
3,2010,Spain,South Africa
4,2006,Italy,Germany
5,2002,Brazil,Japan and South Korea
6,1998,France,France
7,1994,Brazil,USA


### Add new data
#### Add a new row
We can add new rows to an existing dataframe.

Suppose we want to add two more world cup games to our dataframe. We will need to use the `.concat()` method to do it. 

In [13]:
# We first make a dataframe using the new data
new_data = pd.DataFrame({'Year': [1982, 1978],
                         'Champion': ['Italy', 'Argentina'],
                         'Host': ['Spain', 'Argentina']})
new_data

Unnamed: 0,Year,Champion,Host
0,1982,Italy,Spain
1,1978,Argentina,Argentina


In [14]:
# Use ignore_index parameter to control the index
wcup = pd.concat([wcup, new_data], ignore_index = True)
wcup

Unnamed: 0,Year,Champion,Host
0,2022,?,Qatar
1,2018,France,Russia
2,2014,Germany,Brazil
3,2010,Spain,South Africa
4,2006,Italy,Germany
5,2002,Brazil,Japan and South Korea
6,1998,France,France
7,1994,Brazil,USA
8,1990,Germany,Italy
9,1986,Agentina,Mexico


By default, the ignore_index parameter is set to `False`. It means that Pandas keeps the original index values from the two different input dataframes. This can cause duplicate index values in the concatenated dataframe. Remove `ignore_index = True` and run the above code cell again. Can you see a difference? 

#### Add a new column
We can also add new columns to an existing dataframe. 

Recall that when we make a dataframe, we pass a **dictionary** to the `.DataFrame()` method of Pandas. Each `key:value` pair is a column of the table: the key is the header and the value is the list of data under that header. To make a new column, we first put the new data we want to add in a list. Then, we could add the new column in the same way that we add a new `key:value` pair to a dictionary. 

In [15]:
# Add a new column of score and a new column of runner-up
score = ["2-1", ### put the data in a list
         "4-2", 
         "1-0", 
         "1-0", 
         "6-4", 
         "2-0", 
         "3-0", 
         "3-2", 
         "1-0", 
         "3-2", 
         "3-1",
         "3-1"]


wcup['Score'] = score # make a new column of score
wcup

Unnamed: 0,Year,Champion,Host,Score
0,2022,?,Qatar,2-1
1,2018,France,Russia,4-2
2,2014,Germany,Brazil,1-0
3,2010,Spain,South Africa,1-0
4,2006,Italy,Germany,6-4
5,2002,Brazil,Japan and South Korea,2-0
6,1998,France,France,3-0
7,1994,Brazil,USA,3-2
8,1990,Germany,Italy,1-0
9,1986,Agentina,Mexico,3-2


#### Make a new column based on an old one
It is often the case that we would want to extract some information from an existing column and put it in a new column. 

Suppose we want to make a new column storing the number of goals the winner scored at the final. The 'Score' column actually already has this information available to us. How do we extract the number of goals from the 'Score' column and put it in a new column? We will need to use the `.apply()` method. 

In [16]:
# Use the .apply method to operate on an old column and store the results in a new column
wcup['Goals Scored'] = wcup['Score'].apply(lambda r: r.split('-')[0])
wcup

Unnamed: 0,Year,Champion,Host,Score,Goals Scored
0,2022,?,Qatar,2-1,2
1,2018,France,Russia,4-2,4
2,2014,Germany,Brazil,1-0,1
3,2010,Spain,South Africa,1-0,1
4,2006,Italy,Germany,6-4,6
5,2002,Brazil,Japan and South Korea,2-0,2
6,1998,France,France,3-0,3
7,1994,Brazil,USA,3-2,3
8,1990,Germany,Italy,1-0,1
9,1986,Agentina,Mexico,3-2,3


What the `.apply()` method does is it applies the function within the parentheses to the data in each row of the target column. In our example, the target column is the `Score` column. 

As for the function within the parentheses, it looks different from the familiar way of writing a function in Python. Here, what's before the colon stands for the input to the function. What's after the colon is the output of the function. In the current example, the function takes an input string r, splits r by the hyphen `-` into a list of strings and grabs the first element from the list. 



In [17]:
# A quick refresh of .split()
"2-1".split('-')

['2', '1']

In [18]:
# Grab the first element of a list
"2-1".split('-')[0]

'2'

Let's also create a new column storing the goals conceded by the champion at the final.

In [19]:
# Use the .apply method to operate on an old column and store the results in a new column
wcup['Goals Conceded'] = wcup['Score'].apply(lambda r: r.split('-')[1])
wcup

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded
0,2022,?,Qatar,2-1,2,1
1,2018,France,Russia,4-2,4,2
2,2014,Germany,Brazil,1-0,1,0
3,2010,Spain,South Africa,1-0,1,0
4,2006,Italy,Germany,6-4,6,4
5,2002,Brazil,Japan and South Korea,2-0,2,0
6,1998,France,France,3-0,3,0
7,1994,Brazil,USA,3-2,3,2
8,1990,Germany,Italy,1-0,1,0
9,1986,Agentina,Mexico,3-2,3,2


You are ready to make still another column storing the difference between the scored goals and conceded goals. 

In [20]:
# Make a new column based on two old ones
wcup['Difference'] = wcup['Goals Scored'] - wcup['Goals Conceded']

TypeError: unsupported operand type(s) for -: 'str' and 'str'

We get an error message. Why? Recall that when we create the column 'Score', we store the data as strings. As a result, all the numbers in the column 'Goals Scored' and the column 'Goals Conceded' are also strings. The subtraction operation, however, cannot apply to strings. In order to calculate the difference, we will need to convert the data to numeric types first. In Pandas, we have a method `.astype()` which can easily do that. 

In [21]:
# Change the data type to integer and calculate the difference
wcup['Difference'] = wcup['Goals Scored'].astype(int) - wcup['Goals Conceded'].astype(int)
wcup

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
0,2022,?,Qatar,2-1,2,1,1
1,2018,France,Russia,4-2,4,2,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
4,2006,Italy,Germany,6-4,6,4,2
5,2002,Brazil,Japan and South Korea,2-0,2,0,2
6,1998,France,France,3-0,3,0,3
7,1994,Brazil,USA,3-2,3,2,1
8,1990,Germany,Italy,1-0,1,0,1
9,1986,Agentina,Mexico,3-2,3,2,1


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
You find that the whole Butterfly class did not do well in the English test. You decide to give each student a grade boost of 5%. Create a new column to store the new English scores after the boost. You will need to use the `.apply` method. 

### Sort the data
#### Sort by one column
We can sort the entire dataframe by one column. For example, we can use the `.sort_values()` method to sort the dataframe by the column of `Year`. The data in this column are of numeric type. By default, the dataframe will be sorted by `Year` in an ascending order.

In [22]:
# Sort the dataframe by the column 'Year'
wcup.sort_values(by=['Year'])

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
11,1978,Argentina,Argentina,3-1,3,1,2
10,1982,Italy,Spain,3-1,3,1,2
9,1986,Agentina,Mexico,3-2,3,2,1
8,1990,Germany,Italy,1-0,1,0,1
7,1994,Brazil,USA,3-2,3,2,1
6,1998,France,France,3-0,3,0,3
5,2002,Brazil,Japan and South Korea,2-0,2,0,2
4,2006,Italy,Germany,6-4,6,4,2
3,2010,Spain,South Africa,1-0,1,0,1
2,2014,Germany,Brazil,1-0,1,0,1


In [23]:
# Specify the sorting order
wcup.sort_values(by=['Year'], ascending=False)

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
0,2022,?,Qatar,2-1,2,1,1
1,2018,France,Russia,4-2,4,2,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
4,2006,Italy,Germany,6-4,6,4,2
5,2002,Brazil,Japan and South Korea,2-0,2,0,2
6,1998,France,France,3-0,3,0,3
7,1994,Brazil,USA,3-2,3,2,1
8,1990,Germany,Italy,1-0,1,0,1
9,1986,Agentina,Mexico,3-2,3,2,1


#### Sort by multiple columns
It is a convention to sort the soccer results first by difference (i.e. how many more goals the champion scored than the runner-up) and then by goals conceded (i.e. how many goals the champion lost). Pandas can easily do that. 

In [24]:
wcup.sort_values(by=['Difference', 'Goals Conceded'], ascending=[False, True])

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
6,1998,France,France,3-0,3,0,3
5,2002,Brazil,Japan and South Korea,2-0,2,0,2
10,1982,Italy,Spain,3-1,3,1,2
11,1978,Argentina,Argentina,3-1,3,1,2
1,2018,France,Russia,4-2,4,2,2
4,2006,Italy,Germany,6-4,6,4,2
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
8,1990,Germany,Italy,1-0,1,0,1
0,2022,?,Qatar,2-1,2,1,1


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
Add the student grades from the Hippo Class to your dataframe. You will need to use the `.concat` method.

Sort the new dataframe by the English grade to see who did the best of the two classes in English. Sort the new dataframe by the Math grade to see who did the best of the two classes in Math.

### Select data
There are multiple ways to select data from a dataframe. We will focus two in this notebook: `.iloc` and `.loc`. 
#### .iloc
To the left of each row in a dataframe are index numbers. The index numbers are similar to the index numbers for a Python list; they help us reference a particular row for data retrieval. Also, like a Python list, the index begins with 0. We can retrieve a row using the `.iloc` attribute, which stands for "index location."

The `iloc` indexer is used for integer-location based indexing/selection. 

The syntax of `iloc` is `df.iloc[row selection, column selection]`.

In [25]:
# Select a single row
wcup.iloc[5]

Year                               2002
Champion                         Brazil
Host              Japan and South Korea
Score                               2-0
Goals Scored                          2
Goals Conceded                        0
Difference                            2
Name: 5, dtype: object

When we select multiple rows from a dataframe, we give a starting index and an ending index. Notice that the selected rows will not include the final index row. 

In [26]:
# Select multiple consecutive rows
wcup.iloc[2:5] # select the first five rows of the dataframe

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
2,2014,Germany,Brazil,1-0,1,0,1
3,2010,Spain,South Africa,1-0,1,0,1
4,2006,Italy,Germany,6-4,6,4,2


In [27]:
# Select multiple non-consecutive rows
wcup.iloc[[0,2,5]]

Unnamed: 0,Year,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
0,2022,?,Qatar,2-1,2,1,1
2,2014,Germany,Brazil,1-0,1,0,1
5,2002,Brazil,Japan and South Korea,2-0,2,0,2


In [28]:
# Select a single column
wcup.iloc[:,1]

0             ?
1        France
2       Germany
3         Spain
4         Italy
5        Brazil
6        France
7        Brazil
8       Germany
9      Agentina
10        Italy
11    Argentina
Name: Champion, dtype: object

In [29]:
# Select multiple consecutive columns 
wcup.iloc[:,2:5]

Unnamed: 0,Host,Score,Goals Scored
0,Qatar,2-1,2
1,Russia,4-2,4
2,Brazil,1-0,1
3,South Africa,1-0,1
4,Germany,6-4,6
5,Japan and South Korea,2-0,2
6,France,3-0,3
7,USA,3-2,3
8,Italy,1-0,1
9,Mexico,3-2,3


In [30]:
# Select multiple non-consecutive columns
wcup.iloc[:,[1,3,5]]

Unnamed: 0,Champion,Score,Goals Conceded
0,?,2-1,1
1,France,4-2,2
2,Germany,1-0,0
3,Spain,1-0,0
4,Italy,6-4,4
5,Brazil,2-0,0
6,France,3-0,0
7,Brazil,3-2,2
8,Germany,1-0,0
9,Agentina,3-2,2


Now that you know how to select rows and columns from a dataframe using `.iloc[ ]`. You should be able to figure out how to get a slice of a dataframe using `.iloc[ ]`. For example, if you would like to know the host, champion, and score of the world cup games between 1994 and 2010. How do you slice the dataframe `wcup` to get the part you are interested in?

In [31]:
# Slice the dataframe using .iloc[ ]


#### .loc
Another way to select data from a dataframe is to use the `.loc` attribute. 

The syntax of `.loc` is the same as `.iloc`: `df.loc[row selection, column selection]`.

We can use the index number to select rows from a dataframe. 

In [32]:
# Use row index to select a row
wcup.loc[1]

Year                2018
Champion          France
Host              Russia
Score                4-2
Goals Scored           4
Goals Conceded         2
Difference             2
Name: 1, dtype: object

While we can use the row index to select a row, `.loc` is actually mainly used to select data based on label. For example, we could set the index of the dataframe to the `Year` column. 

In [33]:
# Set the dataframe index to Year
wcup.set_index('Year', inplace=True)

In [34]:
# Use the row label to select a row
wcup.loc[2006]

Champion            Italy
Host              Germany
Score                 6-4
Goals Scored            6
Goals Conceded          4
Difference              2
Name: 2006, dtype: object

In [35]:
# Select multiple rows
wcup.loc[[2002, 2006, 2018]]

Unnamed: 0_level_0,Champion,Host,Score,Goals Scored,Goals Conceded,Difference
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2002,Brazil,Japan and South Korea,2-0,2,0,2
2006,Italy,Germany,6-4,6,4,2
2018,France,Russia,4-2,4,2,2


The headers are the labels of the columns. We can select columns using their label. 

In [36]:
# Select a column 
wcup.loc[:,'Host']

Year
2022                    Qatar
2018                   Russia
2014                   Brazil
2010             South Africa
2006                  Germany
2002    Japan and South Korea
1998                   France
1994                      USA
1990                    Italy
1986                   Mexico
1982                    Spain
1978                Argentina
Name: Host, dtype: object

In [37]:
# Select multiple columns
wcup.loc[:,['Champion', 'Host', 'Score']]

Unnamed: 0_level_0,Champion,Host,Score
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022,?,Qatar,2-1
2018,France,Russia,4-2
2014,Germany,Brazil,1-0
2010,Spain,South Africa,1-0
2006,Italy,Germany,6-4
2002,Brazil,Japan and South Korea,2-0
1998,France,France,3-0
1994,Brazil,USA,3-2
1990,Germany,Italy,1-0
1986,Agentina,Mexico,3-2


Now that you know how to select rows and columns from a dataframe using `.loc[ ]`. You should be able to figure out how to get a slice of a dataframe using `.loc[ ]`. For example, if you would like to know the champion, score and runner-up of the world cup games between 1994 and 2010. How do you slice the dataframe `wcup` to get the part you are interested in?

In [38]:
# Slice the dataframe using .loc[ ]


<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
You have three students who had failed their math test last time. You would like to select their data from the dataframe to see how they did this time. 

You can choose either `.loc[ ]` or `.loc[ ]` to do this exercise. 

___
## Lesson Complete

Congratulations! You have completed *Pandas 1*.

### Start Next Lesson: [Pandas 2 ->](./pandas-2.ipynb)

### Exercise Solutions
Here are a few solutions for exercises in this lesson.