<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.<br />
___

# Pandas 1

**Description:** This notebook describes how to:
* Create a Pandas Series or a DataFrame
* Explore the data in a dataframe
* Access data from a dataframe
* Add new data to a dataframe
* Sort a dataframe

This is the first notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 90 minutes

**Data Format:** None

**Libraries Used:** Pandas

**Research Pipeline:** None
___

## Intro to Pandas

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png" width="500"></center>

Pandas is a Python library that allows you to easily work with tabular data. Most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.

**Advantages of Spreadsheet Software**
* Point and click
* Easier to learn
* Great for small datasets (<10,000 rows)
* Better for browsing data

**Advantages of Pandas**
* More powerful data manipulation with Python
* Can work with large datasets (millions of rows)
* Faster for complicated manipulations
* Better for cleaning and/or pre-processing data
* Can automate workflows in a larger data pipeline

In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation.

To use Pandas, we will first install the library by running the following command.

In [None]:
# Install pandas
!pip3 install pandas

After the installation is completed, we import the library.

In [None]:
# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

## Pandas Series and Pandas DataFrame
In Pandas, data are stored in two fundamental objects: 

* Pandas Series - a single column of data
* Pandas DataFrame - a table of data containing multiple columns and rows

### Pandas Series

We can think of a Series as a single column of data. Here we have a column called `Champions` with the country names of the winners of the most recent ten FIFA world cup games.

|Champions|
|---|
|Argentina|
|France|
|Germany|
|Spain|
|Italy|
|Brazil|
|France|
|Brazil|
|Germany|
|Argentina|

Let's create a Series based on this column. 

To create our Series, we pass a **list** into the Series method:

In [None]:
# Create a data series object in Pandas
champions = pd.Series(["Argentina",
                       "France", 
                       "Germany", 
                       "Spain", 
                       "Italy", 
                       "Brazil", 
                       "France", 
                       "Brazil", 
                       "Germany", 
                       "Argentina"]
                     )

In [None]:
# Print out the Series
print(champions)

### Pandas DataFrame

While a Pandas Series is a single column of data, a Pandas DataFrame can have multiple columns and rows. 

|Year|Champion|Host|
|---|---|---|
|2022|Argentina|Qatar|
|2018|France|Russia|
|2014|Germany|Brazil|
|2010|Spain|South Africa|
|2006|Italy|Germany|
|2002|Brazil|Korea/Japan|
|1998|France|France|
|1994|Brazil|USA|
|1990|Germany|Italy|
|1986|Argentina|Mexico|

We will create a Pandas DataFrame based on this table. To create our dataframe, we pass a **dictionary** into the DataFrame method:

In [None]:
# Create a Pandas dataframe
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990, 
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"]
                    })

wcup

In a Pandas dataframe, each column is a Pandas series. 

In [None]:
# Get the type of a column in a dataframe
type(wcup['Champion'])

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
You are a middle school teacher. You teach the Butterfly Class and the Hippo Class. Last week, the Butterfly class had an English test and a math test. You would like to make a dataframe to record the English grades and math grades of the students in the Butterfly Class. 

Make a dataframe with three columns: name, English and Math. 

## Explore the data

When we build a dataframe, we would want to get a general idea of how the dataframe is like. The first step is usually to explore the dataframe's attributes. Attributes are properties of the dataframe (not functions), so they do not have parentheses `()` after them. 

|Attribute|Reveals|
|---|---|
|.shape| The number of rows and columns|
|.columns| The name of each column|


To get how many rows and columns a dataframe has, we use the `.shape` attribute. `df.shape` returns a tuple with (number of rows, number of columns).

In [None]:
# df.shape returns a tuple (# of rows, # of columns)
wcup.shape

In [None]:
# Get how many rows a dataframe has
wcup.shape[0]

In [None]:
# Get how many columns a dataframe has
wcup.shape[1]

In [None]:
# Use `.columns` to find the column names
wcup.columns

There are some methods we can use to explore the data as well. 


|Method|Reveals|
|---|---|
|.info( )| Column count and data type|
|.head( )| First five rows|
|.tail( )|Last five rows|

In [None]:
# Use `.info()` to get column count and data type
wcup.info()

We can get a preview of the dataframe. The `.head()` and `.tail()` methods help us do that.

In [None]:
# Display the first five rows of the data
wcup.head()

In [None]:
# Display the last five rows of the data
wcup.tail()

In [None]:
# Specify the number of rows to display
wcup.head(8)

## Access data
In this section, we will take a look at the different ways of accessing the data in a dataframe. 

For example, once you get the column names, you could access a column of your interest. You could either use the bracket notation `df[ColumnName]` or the dot notation `df.ColumnName`.

In [None]:
# Use bracket notation to access the column 'Champion'
wcup['Champion']

You can see that what gets returned is a Pandas series. If you would like for the returned object to be in the format of a dataframe, you could put the column name within a pair of hard brackets. Note that in this case, you
end up with two layers of hard brackets.

In [None]:
# Add one more layer of square brackets to show the result in the format of a dataframe
wcup[['Champion']]

In [None]:
# Use the dot notation to access the column 'Champion'
wcup.Champion

We can also access multiple columns from a dataframe. Note that in this case, you also have two layers of hard brackets.

In [None]:
# Access multiple columns
wcup[['Year','Champion']]

### Access rows and columns
In Pandas, there are two indexers `.iloc` and `.loc` that are often used to access data in a dataframe. 
#### .iloc
`.iloc` allows us to access a row or a column using its integer location.

To the left of each row in a dataframe are index numbers. The index numbers are similar to the index numbers for a Python list; they help us reference a particular row for data retrieval. Also, like a Python list, the index begins with 0. We can retrieve a row using the `.iloc` attribute, which stands for "index location."

The syntax of `.iloc` indexer is `df.iloc[row selection, column selection]`.

In [None]:
# Access a single row
wcup.iloc[5] # Access the row with the index number 5

Again, the returned object is a Pandas series. If you want a dataframe instead, you could put the index number within a pair of hard brackets to do that. You will end up with two layers of hard brackets.

In [None]:
# Access a single row, return a dataframe
wcup.iloc[[5]] # Access the row with the index number 5

When we select multiple consecutive rows from a dataframe, we give a starting index and an ending index. Notice that the selected rows will not include the final index row. 

In [None]:
# Access multiple consecutive rows
wcup.iloc[2:5] # Access the rows with the index number 2, 3, and 4

In [None]:
# Access multiple non-consecutive rows
wcup.iloc[[0,2,5]] # Access the rows with the index number 0, 2, and 5

So far, we have seen how to use `.iloc` to access rows. We can also use the `.iloc` indexer to access columns.

In [None]:
# Access a single column
wcup.iloc[:,1] # Access the second column of the dataframe wcup

With `.iloc`, we cannot use the column name to access a column because `.iloc` accesses data using their integer location. 

In [None]:
# .iloc cannot access a column by its name
wcup.iloc[:,'Champion']

In [None]:
# Access multiple consecutive columns 
wcup.iloc[:,1:3] # Access the second and third column of the dataframe wcup 

In [None]:
# Access multiple non-consecutive columns
wcup.iloc[:,[0,2]] # Access the first and third column of the dataframe wcup 

Now that you know how to select rows and columns from a dataframe using `.iloc`. You should be able to figure out how to get a slice of a dataframe using `.iloc`. For example, if you would like to know the champion of the world cup games between 1994 and 2010. How do you slice the dataframe `wcup` to get the part you are interested in?

In [None]:
# Slice the dataframe using .iloc[ ]


#### .loc
While `.iloc` is integer-based, `.loc` is label-based. It means that you have to access rows and columns based on their row and column labels.

The syntax of of `.loc` is `df.loc[row selection, column selection]`.

At the moment, the labels for the rows are just their index numbers. When we use `.loc` to access a row, it will look very similar to what we did with `.iloc`.

In [None]:
# Access a row using .loc
wcup.loc[0]

But we could make our index column customized. For example, we could use the column `Year` as the index column.

In [None]:
# Set the column 'Year' as the index column
wcup = wcup.set_index('Year')
wcup

After we make the change, we will use the new labels to access the rows. 

In [None]:
# Access a row using .loc
wcup.loc[2006]

In [None]:
# Access multiple consecutive rows
wcup.loc[2018:2010] 

Note that with the label search, the ending index row is included.

In [None]:
# Access multiple non-consecutive rows
wcup.loc[[1994, 2002, 2010]]

In [None]:
# Access a column
wcup.loc[:, 'Host']

Now that you know how to select rows and columns from a dataframe using `.loc[ ]`. You should be able to figure out how to get a slice of a dataframe using `.loc[ ]`. For example, if you would like to know the champion of the world cup games between 1994 and 2010. How do you slice the dataframe `wcup` to get the part you are interested in?

In [None]:
# Slice the dataframe using .loc[ ]


**As a quick reminder**, remember that `.iloc[]` slicing is not inclusive of the final value. On the other hand, `.loc[]` slicing does include the final value. 

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
You have three students who had failed their math test last time. You would like to select their data from the dataframe you created to see how they did this time. 

You can choose either `.iloc[ ]` or `.loc[ ]` to do this exercise. 

#### Set, reset and use indexes
We have seen that by default, the rows in a dataframe are numbered by integer indexes starting from 0. The indexes look like a column to the far left without a name. 

In [None]:
# Get the original dataframe
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990, 
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"]
                    })

wcup

In the previous section on `.loc`, we have learned how to set the index column to one of the columns in the dataframe. This is desirable because a range of integers is not descriptive but a column with a name is descriptive. When we want to locate specific data, descriptive labels are much more useful. 

In [None]:
# Set index column to 'Host'
wcup.set_index('Host')

Note that the `.set_index` method does not make the change in place by default. This is actually good because we can preview the change first before we decide to commit the change. 

In [None]:
# No change to the original dataframe by default
wcup

To make the change permanent, we will need to add the parameter `inplace=True` to `.set_index()`.

In [None]:
# Change the index column and commit the change
wcup.set_index('Host', inplace=True)
wcup

You could also sort the index column. Here, we have a text column as our index colummn. When we sort the indexes, by default, the dataframe will be sorted by the index column in an ascending alphabetical order. 

In [None]:
# Sort the indexes
wcup.sort_index()

You could set the parameter `ascending=False` to sort the indexes in a descending order.

In [None]:
# Specify the ascending order
wcup.sort_index(ascending=False)

Note that the sorting change is not committed by default. If you want to make the change permanent, again, you will have to add `inplace=True`.

In [None]:
# By default, the dataframe is not updated
wcup

Sometimes we would want to change the index column back to the integer column. In this case, we can use the method `reset_index()`.

In [None]:
# Reset the index column
wcup.reset_index()

But again, to make the reset permanent, you will have to add `inplace=True`.

In [None]:
# Reset the index and update the dataframe
wcup.reset_index(inplace=True)
wcup

## Manipulate the data
### Add new data
#### Add a new row
We can add new rows to an existing dataframe.

Suppose we want to add two more world cup games to our dataframe. We will need to use the `.concat()` method to do it. 

In [None]:
# Take a look at our current dataframe
wcup

In [None]:
# We first make a dataframe using the new data
new_data = pd.DataFrame({'Year': [1982, 1978],
                         'Champion': ['Italy', 'Argentina'],
                         'Host': ['Spain', 'Argentina']})

In [None]:
# Concatenate two dataframes 
pd.concat([wcup, new_data])

In [None]:
# Set ignore_index=True, Update the dataframe after the concatenation
wcup = pd.concat([wcup, new_data], ignore_index=True)
wcup

By default, the ignore_index parameter is set to `False`. It means that Pandas keeps the original index values from the two different input dataframes. This can cause duplicate index values in the concatenated dataframe. If you want to ignore the original index values of the dataframes that are concatenated, set the `ignore_index` parameter to `True`.

#### Add a new column
We can also add new columns to an existing dataframe. 

Recall that when we make a dataframe, we pass a **dictionary** to the `.DataFrame()` method of Pandas. Each `key:value` pair is a column of the table: the key is the header and the value is the list of data under that header. To make a new column, we first put the new data we want to add in a list. Then, we could add the new column in the same way that we add a new `key:value` pair to a dictionary. 

In [None]:
# Add a new column of score and a new column of runner-up
score = ["7-5", ### put the data in a list
         "4-2", 
         "1-0", 
         "1-0", 
         "6-4", 
         "2-0", 
         "3-0", 
         "3-2", 
         "1-0", 
         "3-2", 
         "3-1",
         "3-1"]


wcup['Score'] = score # make a new column of score
wcup

#### Make a new column based on an old one
It is often the case that we would want to extract some information from an existing column and put it in a new column. 

Suppose we want to make a new column storing the number of goals the winner scored at the final. The `Score` column actually already has this information available to us. How do we extract the number of goals from the `Score` column and put it in a new column? We will need to use the `.apply()` method. 

In [None]:
# Use the .apply method to operate on an old column and store the results in a new column
wcup['Goals Scored'] = wcup['Score'].apply(lambda r: r.split('-')[0])
wcup

What the `.apply()` method does is it applies the function within the parentheses to the data in each row of the target column. In our example, the target column is the `Score` column. 

As for the function within the parentheses, it looks different from the familiar way of writing a function in Python. Here, what's before the colon stands for the input to the function. What's after the colon is the output of the function. In the current example, the function takes an input string r, splits r by the hyphen `-` into a list of strings and grabs the first element from the list. 



In [None]:
# A quick refresh of .split()
"2-1".split('-')

In [None]:
# Grab the first element of a list
"2-1".split('-')[0]

Let's also create a new column storing the goals conceded by the champion at the final.

In [None]:
# Use the .apply method to operate on an old column and store the results in a new column
wcup['Goals Conceded'] = wcup['Score'].apply(lambda r: r.split('-')[1])
wcup

You are ready to make still another column storing the difference between the scored goals and conceded goals. 

In [None]:
# Make a new column based on two old ones
wcup['Difference'] = wcup['Goals Scored'] - wcup['Goals Conceded']

We get an error message. Why? Recall that when we create the column 'Score', we store the data as strings. As a result, all the numbers in the column 'Goals Scored' and the column 'Goals Conceded' are also strings. The subtraction operation, however, cannot apply to strings. In order to calculate the difference, we will need to convert the data to numeric types first. In Pandas, we have a method `.astype()` which can easily do that. 

In [None]:
# Change the data type to integer and calculate the difference
wcup['Difference'] = wcup['Goals Scored'].astype(int) - wcup['Goals Conceded'].astype(int)
wcup

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
You find that the whole Butterfly class did not do well in the English test. You decide to give each student a grade boost of 10%. Create a new column to store the new English scores after the boost. You will need to use the `.apply` method. 

### Sort the data
#### Sort by one column
We can sort the entire dataframe by one column. For example, we can use the `.sort_values()` method to sort the dataframe by the column of `Year`. The data in this column are of numeric type. By default, the dataframe will be sorted by `Year` in an ascending order.

In [None]:
# Sort the dataframe by the column 'Year'
wcup.sort_values(by=['Year'])

In [None]:
# Specify the sorting order
wcup.sort_values(by=['Year'], ascending=False)

#### Sort by multiple columns
It is a convention to sort the soccer results first by difference (i.e. how many more goals the champion scored than the runner-up) and then by goals conceded (i.e. how many goals the champion lost). Pandas can easily do that. 

In [None]:
wcup.sort_values(by=['Difference', 'Goals Conceded'], ascending=[False, True])

<h3 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h3>
 
Add the student grades from the Hippo Class to your dataframe. You will need to use the `.concat` method.

Sort the new dataframe by the English grade to see who did the best of the two classes in English. Sort the new dataframe by the Math grade to see who did the worst of the two classes in Math.

___
## Lesson Complete

Congratulations! You have completed *Pandas 1*.

### Start Next Lesson: [Pandas 2 ->](./pandas-2.ipynb)

### Exercise Solutions
Here are a few solutions for exercises in this lesson.

In [None]:
# Make a dataframe to record English and Math score of the Butterfly class
butterfly = pd.DataFrame({"Name": ['John Smith', 
                              'Alex Hazel', 
                              'Beatrice Dean', 
                              'Jane White', 
                              'Eve Lynn'],
                          
                         "English": [78,
                                    80,
                                    72,
                                    75,
                                    73],
                          
                         "Math": [80,
                                 75,
                                 95,
                                 70,
                                 82]
                         })
butterfly

In [None]:
# Get the math grades of the three students who failed math last time
butterfly = butterfly.set_index('Name')
butterfly.loc[['John Smith', 'Alex Hazel', 'Jane White'], 'Math']

In [None]:
# Create a new column storing the boosted English grades
butterfly['EnglishBoosted'] = butterfly['English'].apply(lambda r: r*1.10)
butterfly

In [None]:
# Add the student grades from the Hippo class to the Butterfly dataframe
hippo = pd.DataFrame({"Name": ['Joe Smith', 
                              'Alice Charlie', 
                              'Ben Cole', 
                              'Jill Cheung', 
                              'Dave Gale'],
                          
                         "English": [82,
                                    85,
                                    90,
                                    88,
                                    92],
                          
                         "Math": [85,
                                 78,
                                 92,
                                 80,
                                 82]
                         })

hippo = hippo.set_index('Name')
all_students = pd.concat([butterfly, hippo])
all_students

In [None]:
# Sort the new dataframe to see who did best in English of the two classes
all_students.sort_values(by='English', ascending=False)

In [None]:
# Sort the new dataframe to see who did worst in Math of the two classes
all_students.sort_values(by='Math')

In [None]:
0.1 + 0.2