# Introduction to Pandas for Working with Tabular Data

<div class="alert alert-success">
    
## This notebook covers
- Pandas data structures - DataFrames and Series
- Selecting, slicing, and querying DataFrames
- Simple calculations with summary functions
- Sorting and grouping data
- Copying and renaming DataFrame columns
- Handling missing values
- Merging DataFrames and writing to file
</div>

<div class="alert alert-warning">

## Reminders

Remember, you can use Jupyter's built-in table of contents (hamburger on the far left) to jump from heading to heading.

---

This notebook should run in the Anaconda base environment. We'll discuss more about environments later, but for now look for something like the words "Python3" or "base" at the top right of this notebook. If it says "No Kernel", go to the Kernel tab, select Change Kernel, then select the Python3 or base kernel in the pop up window.

---

To run cells in this notebook place your cursor in the cell you want to run, then hit Shift+Enter.

---

To turn on line numbers for code cells go to View menu and click Show Line Numbers.

</div>

# I. Importing Necessary Packages
The following code will load the packages we'll need for this notebook. Packages are collections of code that add functionality to the core Python. It's best practice to import everything we need in one place at the top of a notebook or script.

The "as pd" part of the Pandas import statement below is giving the Pandas package an alias in our notebook. This way when we want to use functions from the Pandas package we can type, for example, ```pd.DataFrame()``` instead of the longer ```pandas.DataFrame()```. An alias can be anything really, but "pd" is the alias that the Pandas user community has settled on.

This particular package was installed to your computer during the installation of Anaconda, so we can simply import it here instead of having to take the extra step of downloading it first. We'll cover how to download and import additional packages in a subsequent notebook.

In [None]:
import pandas as pd

Packages that extend the core Python language, such as the one we imported above usually have a website where we can find tons of helpful information. Package websites may include, for example, "Getting Started" tutorials, in-depth user guides, an API reference that documents the particulars of every single available function, and instructions on where to ask the user community questions, submit bug reports, or make software contributions. 

**If you need help, package websites are one of the first places you should look. Let's take a quick look at the Pandas website [https://pandas.pydata.org/](https://pandas.pydata.org/)**.

# II. Introduction to Pandas Data Structures

On the Pandas website, the package developers describe the project's goal: Pandas "aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language."

Pandas is a powerful tool for working with tabular data such as data stored in spreadsheets, databases, or other table-like formats. The main data structure Pandas uses to hold data is called the *DataFrame*. A DataFrame is a 2-dimensional (rows and columns) structure that can store data of different types including strings, integers, floating point values, categorical data, and more. DataFrames are like a spreadsheet - think of a table of data with column headings, row numbers, and data values. Each column of a Pandas DataFrame is its own data structure called a *Series*. Both data structures (DataFrame and Series) have an *index*. We can think of an index, for now, as a row or line number, but it can be anything, like a date or any other text.

Below are schematics of what a Pandas DataFrame and Series look like, where the darker grey boxes would hold headers (column names) and indexes (row names/numbers), while the lighter grey boxes would hold the data values.

<table><tr>
<!-- <td> <img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg" alt="schematic of a dataframe" width="700"/> </td>
<td> <img src="https://pandas.pydata.org/docs/_images/01_table_series.svg" alt="schematic of a series" width="200"/> </td> -->
<td> <img src="images/01_table_dataframe.svg" alt="schematic of a dataframe" width="700"/> </td>
<td> <img src="images/01_table_series.svg" alt="schematic of a series" width="200"/> </td>
</tr></table
         
(Image Source: [Pandas Docs Getting Started Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html))

## Making a DataFrame from Scratch

We can make a DataFrame programmatically, as opposed to reading data from a file. For example, let's create one from lists.

In [None]:
listOfColumnNames = ["rank", "county_fips",
                     "per_capita_income", "median_household_income",
                     "population", "num_households"]

listOfTuples = [(1,'089',32223,59730,98468,35297),
                (2,'121',27183,56159,145165,52539),
                (3,'073',27399,50075,57786,21237),
                (4,'033',25065,59734,166234,56641),
                (5,'059',23547,49620,140298,50185),
                (6,'047',23111,44550,194029,69384),
                (7,'149',22079,40404,48773,18941),
                (8,'045',21935,44494,43929,17380),
                (9,'081',21831,39049,82910,32086),
                (10,'131',21691,43728,17786,6165)]

# create DataFrame from lists
countyData = pd.DataFrame(listOfTuples, columns = listOfColumnNames)
        
# if placed on a line by itself, we get pretty output of the DataFrame    
countyData

Our data is now in a Pandas DataFrame, which has 6 named columns of data (6 Series) as well as a row index (the bold column without a header).

Isolating a Series from a DataFrame would look like this:

In [None]:
countyData["population"]

We can see that the rendering of a Series isn't pretty like that of a DataFrame. But notice how a Series isn't just the data values. The index remains attached and the column name is there as well. How do we know this is a Pandas Series? We can use Python's built-in function ```type()``` like we did in the previous notebook.

In [None]:
type(countyData["population"])

And for good measure...

In [None]:
type(countyData)

## Loading Tabular Data from a File into a DataFrame

Pandas makes it very easy to load an Excel file or other tabular data sources like .csv files into DataFrames. 

The .csv file extension is a common file format for tabular data. CSV stands for comma separated values. Inside a CSV file, you will see rows of data with commas used between the values of each data column (i.e., *comma delimited*). Generally, we use .csv files instead of Excel because Excel has a limit on length (1,048,576 rows). (Also, reading an Excel file with Pandas requires installation of an additional package. We'll cover that in a subsequent lesson).

A raw CSV file with a header row might look like this, for example:

<pre>
Book Title,Publisher,Price
War and Peace,Vintage Classics,12.99
"Our Bodies, Ourselves",Touchstone,48.38
Putin's Playbook,"Simon & Schuster, Inc.",14.49
</pre>

Notice that the second book title listed and the third publisher name have a comma in the data value. Those values are surrounded by quotation marks to avoid Pandas interpreting the comma as a new column. This is important to do when you are creating your own CSV data.

Let's begin by loading an example data file that is .csv format into a Pandas DataFrame. 

The example data used in this notebook is college football bowl data. While this particular data may not be relevant to your job or research, the data exploration and cleaning techniques we'll work through do broadly apply to any tabular data you may have (in .csv, .xls(x) or even .txt formats).  


In [None]:
bowlData = pd.read_csv('data/collegefootballbowl.csv')
bowlData

Notice what gets printed to the screen when a DataFrame has many rows. We can see the first 5 and last 5 rows of data, followed by the shape (rows, columns) of the full DataFrame.

Also notice some columns contain NaN values. NaN stands for "not a number" and usually represents a missing data value of type float. We'll cover more about different missing data values, how to handle them, and the missing data features Pandas offers later.


<div class="alert alert-danger">
    
**Sidebar about data management for tabular data:** It's important to, at minimum, create a data dictionary that describes what is in your data file, even if your data file contains column names. Often there is more information (metadata) that is required by future users of the data than just the column names and data values. Metadata for the collegefootballbowl.csv can be found in the [data/collegefootballbowl.txt](data/collegefootballbowl.txt) file, which tells future data users the original source of the data, when the data file was created, and contains a data dictionary that describes each column of data. This type of information is super important especially when your data values have units. Even future you may forget if your temperature data, for example, is in Celsius or Fahrenheit, or if your precipitation data had units of cm, mm, inches, or hundredths of inches! 
</div>

## Viewing the Head and Tail of a DataFrame

```.head()``` lets us view the first N rows of a DataFrame 

```.tail()``` lets us view the last N rows

Using either one of these methods on a DataFrame without any additional parameters will show us 5 rows. Enter an integer as a parameter to either function to view a different number of rows.



In [None]:
# view first 10 rows
bowlData.head(10)

In [None]:
# view last 10 rows
bowlData.tail(10)

## Getting DataFrame Information

### Shape Property: How Many Rows / Columns?

```.shape``` returns a tuple (rows, columns) and is the most concise way to see how many total rows and columns are in a DataFrame. 

In [None]:
bowlData.shape

### Dtypes Property: Understanding the Data Types of Each Column

Often, we will need to know the data type of each column, ```.dtypes``` will give us that information.

Anything that says "object" is probably storing strings. int64 and float64 are numerical data (numbers).

In [None]:
bowlData.dtypes

How were the data types for each column determined? 

When we read the .csv file using ```pd.read_csv()``` Pandas inferred the data type of each column based on the column's data values. If all values in a column appear to be integers, Pandas will infer that the column is data type int. Sometimes, there can be mistakes in our data files though. We could have a data column, for example "winner_rank", that should contain all integers or missing values but a data entry mistake in the file has added 1 or more values that are non-numeric in that column. A lot of real data is "messy" like this. In these cases where there are mixed data types in a single column, Pandas usually reads the whole column as strings and assigns a data type of object. This is in fact the case with our data columns "winner_rank" and "loser_rank", which we will investigate a bit more later.

### Describe Method: Summary of Numerical Data (count, mean, std, min, quartiles, max)

We can access simple statistical information for numerical data columns using the ```.describe()``` method.

In [None]:
bowlData.describe()

Notice that ```.describe()``` returns statistical information only for numerical data columns. 

For which year was the earliest data record in our DataFrame collected? We can see the answer is the "min" of the "year" column, 1901.

To see all columns, numeric or not, we can add a parameter to ```.describe()``` like this:

In [None]:
bowlData.describe(include = 'all')

We can now see 3 additional statistics that operate only on non-numeric data columns: "unique", "top", and "freq". NaN appears wherever the data type is not appropriate for the statistic. 

Who is the most common sponsor and how many times were they a sponsor? 

Looking to the "sponsor" column, the "top" row indicates the most common data value is Outback Steakhouse and the "freq" row indicates that Outback Steakhouse was a sponsor 26 times.

### Info Method: Understanding all Fields, Null Values, Dtypes, Shape, Size, etc.

In [None]:
bowlData.info()

# III. Selecting/Slicing Data with .loc[]

The Pandas function ```.loc[]``` allows us to directly access rows by index and columns by name ```.loc[row_index,column_name]``` or to access all rows of data based on a conditional ```.loc[condition]```. Let's take a look, starting with selecting by row index and column name.

## Single Cell

In [None]:
# SELECT a single cell - the attendance column where row index is 1
bowlData.loc[1, 'attendance']

If we put an integer in the row part of ```.loc[row_index,column_name]```, Pandas assumes this is the index of the row.

## Single row

In [None]:
# SELECT whole row of data where row index is 1
bowlData.loc[1, :]

A colon by itself means "everything", here specifically it means all columns.

## Single Column

In [None]:
# SELECT a single column of data, just the attendance column
bowlData.loc[:, 'attendance']

Here the colon by itself means all rows.

## Slice of Rows

In [None]:
# SELECT a slice of rows where row index is 1 to 6
bowlData.loc[1:6, :]

A colon between two integers means a slice (meaning get multiple rows).

Notice that a slice of a Pandas DataFrame is inclusive of the ending row index 6, such that a slice of rows 1:6 returns 6 rows. Just a quick note that this is not always the case with other Python packages. Array slicing with the Numpy or Xarray packages, for example, are exclusive of the ending index, but we'll cover that in a subsequent lesson.

## All Rows, Slice of Columns

In [None]:
# SELECT all rows but only a slice of columns from year to winner_rank 
bowlData.loc[:, 'year':'winner_rank']

Notice that when the output exceeds twenty lines, Jupyter will format nicely and show a bit at the beginning and a bit at the end. The same thing will happen if the output has too many columns.

## Slice of Rows, Slice of Columns

In [None]:
# SELECT a slice of rows (index 10 to 15) and a slice of columns ('year' to 'winner_rank')
bowlData.loc[10:15, 'year':'winner_rank']

## Slice of Rows, Particular Columns

In [None]:
# SELECT a slice of rows and two specific columns
bowlData.loc[10:15, ['year', 'winner_rank']]

## All Rows Where Column has Certain Value

Now, instead of putting row indexes and column names in ```.loc[]``` we'll use a condition instead.

In [None]:
# find all rows where the year column is equal to 1901
bowlData.loc[bowlData['year'] == 1901]

This returned only one row. If there were more rows where the "year" column was equal to 1901, then there would be more rows returned.

In [None]:
# find all rows where the sponsor column is missing data
bowlData.loc[bowlData['sponsor'].isna()]

## Rows Based on Column Comparison

If we want to return all the games where the winner was ranked below (higher number) the loser (lower number), we could do it this way.

In [None]:
# find all rows where the winner ranked below the loser
bowlData.loc[bowlData['winner_rank'] > bowlData['loser_rank']]

**Be careful in your analysis! Do you notice anything suspicious about the results that were returned?**

We're looking for cases where the underdog won. For example, a team that was ranked 20th, beat the team that was ranked 10th (see index 1429). If we look closely in the results above we can see multiple examples of results returned that are NOT what we asked for. For example, index 1427 shows the winner had a better rank (8th) than the loser (13th). So, we've got incorrect results but did not receive an error message. 

**Why did this happen??** 

This comes back to the data type of the columns "winner_rank" and "loser_rank". A ranking should be a numerical data type (such as int or float). If these columns had a numeric data type then we wouldn't have experienced any problems using the greater than operator. But if those columns contain non-numeric data (dtype object, which usually indicates string data) then that could yield unexpected results when comparing if one string is greater than another. Let's take a look at the data type for "winner_rank" and "loser_rank".

In [None]:
bowlData[['winner_rank', 'loser_rank']].dtypes

Uh oh! We can see that Pandas determined the data type of those columns to be object, which is non-numeric. We discussed briefly already why this might happen. This should indicate to us that there must be some messy data somewhere in those columns, some data values that don't look like numbers. And due to this, Pandas is treating those data columns like strings instead of numeric numbers. So, when we ask if the "winner_rank" is greater than the "loser_rank", the process of comparing two string values with the greater than operator returns unexpected results.

When beginning work with any dataset it is best to do some data cleaning first to make sure your columns are of the expected data types and remove any whacky data values in order to avoid problems like this one. Otherwise, you might easily miss the fact that your code isn't doing what you intended.

Let's see what the problem is, clean up those two columns, ensure their dtype is numeric, and try our selection again.

We'll start by using a Pandas function that we haven't seen yet: ```.unique()``` on the "winner_rank" column of the DataFrame.

In [None]:
# look at all the unique data values in the winner_rank column
bowlData['winner_rank'].unique()

Look at that! Someone has entered a value of "Pennsylvania" in the column of "winner_rank" somewhere in the data file. That doesn't make sense at all. Notice, we also have some missing data (NaN), but that shouldn't cause us problems here.

Let's force the "winner_rank" data column to be numeric using another Pandas function that we haven't seen yet: ```pd.to_numeric()```

In [None]:
# reassign all the values in the column winner_rank with numeric data values
# any value that does not look like a number will be changed to the missing data value
bowlData['winner_rank'] = pd.to_numeric(bowlData['winner_rank'], errors = 'coerce')

# look again at the unique data values
bowlData['winner_rank'].unique()

In [None]:
# look at the new data type
bowlData['winner_rank'].dtype

How did we know what parameters to enter in ```pd.to_numeric()``` function? That information can be found in the Pandas documentation on the [API reference page for ```pd.to_numeric()```](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html). Or we could have done a quick Google search for something like "how to use pandas to_numeric", which will bring up a great AI generated answer (at least it does in the USA at the time this notebook was created), the link to the Pandas API reference page, and many other websites that demonstrate how to use the function.

Let's check out what the problem is with "loser_rank".

In [None]:
# look at all the unique data values
bowlData['loser_rank'].unique()

Again, there's a nonsensical data value "TN". We'll do the same process to clean the "loser_rank" column and convert to a numeric data type.

In [None]:
# reassign all the values in the column with numeric data values
# any value that does not look like a number will be changed to the missing data value
bowlData['loser_rank'] = pd.to_numeric(bowlData['loser_rank'], errors = 'coerce')

# look again at the unique data values
bowlData['loser_rank'].unique()

In [None]:
# look at the new data type
bowlData['loser_rank'].dtype

Lastly, let's try the row selection by column comparison again.

In [None]:
# finds all rows where the winner ranked below the loser
bowlData.loc[bowlData['winner_rank'] > bowlData['loser_rank']]

Notice how the result now returns 215 rows whereas before we got 255 rows. Our original selection returned 40 incorrect results!

<div class="alert alert-info"> 

## Exercise 1: Selecting with .loc[]

Use ```.loc[]``` to select row indexes 100 through 110 and the three columns "year", "winner_points", "loser_points" from the ```bowlData``` DataFrame.

</div>

In [None]:
# add your code here


<div class="alert alert-info"> 

Now select the rows where attendance is greater than 100,000.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
How many games in the DataFrame had attendance greater than 100,000? Don't count the rows in your answer above, determine your answer programmatically.
</div>

In [None]:
# add your code here


# IV. Selecting data with .query()

Similar to ```.loc[]```, the ```.query()``` method can be used to select rows in a DataFrame based on a condition. Inside the ```.query()``` we can put a condition that looks a little bit like a database query. You may like this method if you have experience working with databases.

## Rows Where Column has Certain Numerical Value

In [None]:
# find all rows where year is equal to 1901
bowlData.query("year == 1901")

## Rows Where Column has String Value

In [None]:
# find all rows where the bowl_name is "Rose Bowl"
# notice the single quotes around the string "Rose Bowl"
bowlData.query("bowl_name == 'Rose Bowl'")

## Rows Based on Substring Comparison

In [None]:
# Grab all rows where the word "State" appears in the winner_tie column
bowlData.query("winner_tie.str.contains('State')")

## Rows Based on Column Comparison

In [None]:
# find all rows where the loser ranked higher (smaller number) than the winner (larger number)
bowlData.query("loser_rank < winner_rank")

## Rows Based on Comparison with a Variable

In [None]:
# First, let's calculate the mean winner_points x 2
# .mean() is calculating the mean of the entire winner_points column
twiceTheMean = bowlData.winner_points.mean() * 2

# then, we can use that value to query for winners with more than twice the mean
bowlData.query("winner_points > @twiceTheMean")

## Multiple Criteria Query

In [None]:
# Find all the rows where the winner is Alabama, 
# AND Alabama had more than 50 points, 
# AND they weren't ranked number 1.
bowlData.query("(winner_tie == 'Alabama') and (winner_points > 50) and (winner_rank > 1)")

<div class="alert alert-info"> 
    
## Exercise 2: Selecting with .query()

Use ```.query()``` to find rows where the "winner_tie" column contains "State", the "bowl_name" contains "Rose", and the attendance is greater than 75,000.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
Show programmatically how many rows your query found.
</div>

In [None]:
# add your code here

# V. Querying a DataFrame without .query()

Different syntax can be used to accomplish the same data queries we covered above without using ```.query()```. This syntax may be preferable if you don't have experience working with databases.

In [None]:
# find all rows where year is equal to 1901 without using .query

# copy of code from the .query() section for reference
# bowlData.query("year == 1901") 

# alternative query syntax
bowlData[bowlData.year == 1901]

In [None]:
# find all rows where the bowl_name is "Rose Bowl"

# copy of code from the .query section for reference
# bowlData.query("bowl_name == 'Rose Bowl'") 

# alternative query syntax
bowlData[bowlData.bowl_name == "Rose Bowl"]

In [None]:
# Grab all rows where the word "State" appears in the winner_tie column

# copy of code from the .query section for reference
# bowlData.query("winner_tie.str.contains('State')")

# alternative query syntax
bowlData[bowlData.winner_tie.str.contains('State')]

In [None]:
# find all rows where the loser ranked higher (smaller number) than the winner (larger number) using .query

# copy of code from the .query section for reference
# bowlData.query("loser_rank < winner_rank")

# alternative query syntax
bowlData[bowlData.loser_rank < bowlData.winner_rank]

In [None]:
# Rows Based on Comparison with a Variable

# copy of code from the .query section for reference
# twiceTheMean = bowlData.winner_points.mean() * 2
# bowlData.query("winner_points > @twiceTheMean")

# alternative query syntax
twiceTheMean = bowlData.winner_points.mean() * 2  # the same
bowlData[bowlData.winner_points > twiceTheMean]

In [None]:
# Find all the rows where the winner is Alabama, 
# AND Alabama had more than 50 points, 
# AND they weren't ranked number 1.

# copy of code from the .query section for reference
# bowlData.query("(winner_tie == 'Alabama') and (winner_points > 50) and (winner_rank > 1)")

# alternative query syntax
bowlData[(bowlData.winner_tie == 'Alabama') & (bowlData.winner_points > 50) & (bowlData.winner_rank > 1)]

Notice the difference here where we're using ```&``` to link multiple conditions together instead of how in the previous notebook Python Language Basics we learned how to use the boolean operator ```and``` to link multiple conditions. The ```&``` is called  "bitwise and" whereas Python ```and``` is called "logical and". Unless using ```.query()```, Pandas requires bitwise and, bitwise or, and bitwise not, which are written as: ```&```, ```|```, ```~```. Otherwise, we will get an error. 

<div class="alert alert-info"> 
    
## Exercise 3: Query without using .query()

Without using ```.query()``` repeat the query from exercise 2 (find rows where the "winner_tie" column contains "State", the "bowl_name" contains "Rose", and the "attendance" column is greater than 75,000).

</div>

In [None]:
# add your code here


# VI. DataFrame Manipulation

## Summary Functions

Pandas offers a handful of summary functions that can be applied to a column or columns of a DataFrame (Series objects). These functions are:

- ```.sum()``` Sum values of each object. 
- ```.count()``` Count non-NA values of each object. 
- ```.median()``` Median value of each object. 
- ```.quantile([0.25,0.75])``` Quantiles of each object. 
- ```.min()``` Minimum value in each object. 
- ```.max()``` Maximum value in each object. 
- ```.mean()``` Mean value of each object. 
- ```.var()``` Variance of each object. 
- ```.std()``` Standard deviation of each object.

We won't cover all of these, but let's try a few.

In [None]:
# the mean of winner_points
print("Winners had an average of ", bowlData['winner_points'].mean())

# the median of loser_points
print("Losers had a median of ", bowlData['winner_points'].median())

# lowest score of a winning team
print(f"The highest score of a losing team is {bowlData['loser_points'].max()}")

Wow, 61 points and still a loss... ooof.

<div class="alert alert-info"> 
    
### Exercise 4: Find the standard deviation of a column

Create a print statement similar those above to print the standard deviation of the "winner_points" column and "loser_points" column. 
</div>

In [None]:
# add your code here


## Sorting Data

Let's begin with a simple sort on a numeric data column with Pandas ```.sort_values()```. This function will sort numeric data in descending order by default unless we provide the parameter ```ascending=True```. 

First, we'll double check that the "year" column of the DataFrame is numeric (we'll get into why we're doing this in a minute). Then we'll sort by "year" ascending and return only the "year" and "winner_points" columns.

In [None]:
# see if year column is numeric
bowlData.year.dtype

Excellent, we should be good to proceed.

In [None]:
# sort specific columns by year
bowlData[['year', 'winner_points']].sort_values(by = ['year'], ascending = True)

Now, let's do a more complex sort. Sort by "winner_points" descending to find the highest score by an upset winner (winner ranked lower than loser).

We'll use the columns "winner_rank" and "loser_rank" to accomplish this sort (which we have already converted to numeric data) as well as the "winner_points" column.

Let's double check the data type of the "winner_points" column before we sort.

In [None]:
# see if winner_points column is numeric
bowlData.winner_points.dtype

Great, another numeric data column.

In [None]:
# sort by winner_points to find the highest score by an upset winner
bowlData[bowlData.winner_rank > bowlData.loser_rank].sort_values(by = ['winner_points'], ascending = False)

Notice that our code returned a lot of rows but we'll find the answer to our question in the first row in the "winner_points" column: 70.

How would we do the same sort but return only the highest score by an upset winner as opposed to all the data rows that were returned in the code above?

Since we are sorting with ```ascending = False```, the highest "winner_points" is located in the first row of returned data. In this case, the index of the first row is 727. We don't know ahead of time what the index value of the first row of results with be though. Don't worry! We can grab the first row of the sorted results using the integer row position 0 with ```.iloc[]``` as opposed to using the row index with ```.loc[]```.

From the results above we expect the output of the following code to be 70.

In [None]:
# sort by the winner_points to find the highest score by an upset winner
bowlData[bowlData.winner_rank > bowlData.loser_rank].sort_values(by = ['winner_points'], ascending = False).iloc[0].loc['winner_points']

Wow! What just happened? We used ```.iloc[0]``` to select only the first row of the results returned by sorting and ```.loc['winner_points']``` to return the value of the "winner_points" column from the results return by ```.iloc[0]```. 

Are you beginning to see the power of Pandas? We can string together many functions in a row to achieve what we're looking for.

Does sorting work on strings? 

Yes, but we have to be careful! If our strings only contain letters, ```.sort_values()``` will sort from A to Z or Z to A as we'd expect. 

But, if our strings contain numbers or letters and numbers together, ```.sort_values()``` may not return the sort order that we're expecting. This is because Pandas sorts strings character by character using *lexicographic order*. This is also why we've been double checking that our columns with numbers are numeric and not object data type. 

Before we move on, let's work through a quick example of what happens if when Pandas sorts strings that contain numbers in lexicographic order.

In [None]:
# create a DataFrame with a column of data that looks like numbers, but are actually strings
df = pd.DataFrame({'ranking': ['1', '3', '5', '2', '4', '10']})
print(df.ranking.dtype)
df

The "ranking" column that looks like numeric data is actually strings. What happens when we sort ascending?

In [None]:
df.sort_values(by = ['ranking'], ascending = True)

This is just something to be aware of. If you want to keep your numerical-looking data as strings but sort in numerical order, there would be some extra steps to execute. We won't cover that here but if you're interested you can find plenty of information on how to accomplish that on the web. For example, [this user question and answer on stackoverflow.com](https://stackoverflow.com/questions/37693600/how-to-sort-dataframe-based-on-particular-stringcolumns-using-python-pandas). That being said, it's probably a good idea to ensure any numerical-looking data columns are assigned numerical data types in order to avoid any unexpected issues caused by string numbers. 

## Grouping Data

```.groupby()``` will partition data into groups which we can then operate on using functions like```.mean()```, ```.sum()```, etc. Let's start with a super simple example before we try to use ```.groupby()``` on our ```bowlData```. Pretend we have been observing two falcons and two parrots and are keeping data records on their maximum observed flight speed in miles per hour.

In [None]:
# a new very simple DataFrame 
birds = pd.DataFrame({'species' : ['falcon', 'falcon', 'parrot', 'parrot'],
                   'individual' : ['f01', 'f02', 'squawky', 'pretty boy'],
                   'age_class' : ['adult', 'adult', 'juvenile', 'adult'],
                   'max_speed_mph' : [230., 240., 35., 40.]})
birds

We can use ```.groupby()``` to find the average max speed of each species. We'll first group the DataFrame rows by the "species" column. Then, for each group (there will be two groups: falcon rows and parrot rows) take the mean of the "max_speed_mph" column.

In [None]:
birds.groupby("species")["max_speed_mph"].mean()

Nice! The average max speed of all the falcons is 235 mph and the average max speed of all the parrots is 37.5 mph.

Pandas ```.groupby()``` will work similarly on our much larger ```bowlData``` DataFrame.

Let's group by "winner_tie" (which is the name of the winning team) and then find the average number of points the winning team scored ("winner_points").

In [None]:
avgPointsByWinners = bowlData.groupby("winner_tie")["winner_points"].mean()
avgPointsByWinners

With the ```bowlData``` DataFrame, it is a little more obvious to see the default behaviour of the ```.groupby()``` with regard to how the function sorts the result. ```.groupby()``` will sort ascending on the grouping column (here, "winner_tie") when returning the result, which is why our result is sorted by "winner_tie" from A to Z. This is the default behavior unless we specify ```sort=False``` as a parameter in ```.groupby()```. 

<div class="alert alert-info"> 

### Exercise 5: Using .groupby() 

Find the maximum observed flight speed for each species in the birds DataFrame.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
This next one is challenging and uses a function we haven't covered yet. See if you can work it out!

For each species, find the name of the fastest individual. 

Hint: you will need to use ```.loc()```, ```.groupby()```, and also a function called ```.idxmax()``` (in that order). See if you can figure it out with some help from the web. The doc page for [.idxmax()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html) and [this user question and answer(s) from stackoverflow.com](https://stackoverflow.com/questions/39964558/pandas-max-value-index) may get you part of the way to the answer.
</div>

In [None]:
# add your code here


## Renaming Columns

To change column names, the easiest way is this:

In [None]:
# create a dictionary where key is the old name and value is the new name
columnMap = {"mvp" : "most_valuable_player", "winner_tie" : "winner"}

# then use the .rename() function
bowlData = bowlData.rename(columns = columnMap, errors = "raise")

bowlData

## Handling Missing Values

Sometimes we need to assess how much of our data is missing to make sure any statistics we apply to the data are robust. 

Sometimes having the missing data value (NaN) in our data is beneficial. Many functions will simply ignore missing data or propagate missing data values. For example, if we were to add one data value to another and the values were NaN + 5, the result would be NaN. This is often desirable.

But in some cases, like for various machine learning techniques, we may need to ensure that there are no missing data values present in our data at all. In this case we may need to fill the missing data values with a different number or drop rows with missing data from our DataFrame entirely.

Let's look at the Pandas functions that can help us with missing data.

In [None]:
# show how many missing values are present in each column
bowlData.isna().sum()

In [None]:
# look at all the rows where winner_rank is missing
bowlData[bowlData['winner_rank'].isna()]

In [None]:
# replace all missing values with 0 
# notice we are making a copy of our data by saving the DataFrame to a new variable here
betterBowlData = bowlData.fillna(0)

# let's see if any missing values are still present - shouldn't be
betterBowlData.isna().sum()

Let's pause for a minute to think about what we just did. We filled all NaN with 0. Does that make sense? Does it make sense to have "winner_points" = 0 or "sponsor" = 0? We can see how filling NaN with another value may end up being confusing later. So think carefully if you really need to replace missing data values, or if filling with another value like 0 will work for your analysis. 

Instead of filling NaN with a different value, we may need to drop full data rows if there are any missing values present. This is how we can do that:

In [None]:
# drop all rows where at least 1 column of data is NaN
lessBowlData=bowlData.dropna(how = 'any')
lessBowlData

When we first loaded ```bowlData```, we originally had 1527 rows of data and now after dropping all rows that have at least one missing value, we are left with only 269 rows of data.

If the goal was to drop all rows where all columns contain the missing data value, we could use ```.dropna(how='all')```.

## Creating New Columns Derived from Existing Columns

Pandas allows us to easily use existing columns to calculate new data and save the calculations into a new column in the DataFrame. 

Here's an example. Let's define a blowout as when the winning team beats the losing team by 21 or more points. The task now is to create a new column in our DataFrame that indicates whether each game (row) in our dataset was a blowout. We'll fill the column with values of True or False.

In [None]:
bowlData['blowout'] = bowlData.winner_points - bowlData.loser_points >= 21
bowlData

It's that simple! There is no looping required. Pandas is smart enough to do the subtraction row by row and fill the result in the appropriate place all on its own. Writing a loop would be much slower, which is often the case with Python. It's most efficient to use the built-in capabilities of whatever packages you're working with. Try to avoid looping wherever you can.

<div class="alert alert-info"> 
    
### Exercise 6: Copy an existing column to a new column

Modify the DataFrame ```bowlData``` by copying the "year" column to a new column called "year_copy". Print ```bowlData``` to the screen to check your work.
</div>

In [None]:
# add your code here


## Merging DataFrames Together

Let's pretend we have additonal college football bowl data hanging out in a separate CSV file. The additional data has the same 'id' information (row index value) as in data/collegefootballbowl.csv, but not all id's (1-1527) are present. How do we join this additional data to the ```bowlData``` DataFrame?

In [None]:
# load the data
moreBowlData = pd.read_csv('data/morecollegefootballbowldata.csv')
moreBowlData

Oh wow, that's not much data! But let's merge it into ```bowlData``` anyway. The merge column will be the "id" column since that is the only common column between the two data files.

In [None]:
allBowlData = pd.merge(bowlData, moreBowlData, how = 'outer', on = 'id')
allBowlData

We can see that for the three id's where we have "tickets_sold" and "best_selling_concession" data, those data values now appear in the merged DataFrame. And everywhere else in those two columns got filled with NaN. There are tons of ways to merge DataFrames. Check out the [```pd.merge()``` API reference](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) for more information.

# VII. Writing a DataFrame to a File

Pandas has a function ```.to_csv()``` for writing a DataFrame to a .csv file.

Let's write our bowl data to a file.

In [None]:
bowlData.to_csv(r'data/bowlData.csv', index = None, header=True)

Locate the file you just wrote, open it, and see what it looks like. If you open the file with a double click in JupyterLab it will look like a spreadsheet- this is Jupyter's CSV viewer. If you right click the file in JupyterLab, choose Open With, then choose Editor to open the file, you should see the comma separated header and data values.

# VIII. Pandas Datetimes

*Datetimes* are a special type of object that represent a date and a time. Python has a built-in datetime data type that we did not previously cover because Pandas' datatimes offer much more functionality. In this section we'll cover the Pandas datetime objects *Timestamp* and *DatetimeIndex*.

## Creating Datetimes with Pandas

The string format of a datetime looks like ```YYYY-MM-DD hh:mm:ss.ns```. The date part of the object in years, months, and days comes before the space. The time part comes after the space in hours, minutes, seconds, and seconds fraction. The highest precision of a Pandas datetime object is nanoseconds but not all datetimes need to be that precise. 

Pandas stores single datetimes as Timestamp objects and sequences of datetimes as DatetimeIndex objects. We can create a Timestamp object either with ```pd.Timestamp()``` or ```pd.to_datetime()```. We can create a DatetimeIndex object either with ```pd.to_datetime()``` or ```pd.date_range()```.

Let's create our first datetimes. The Pandas functions for creating Timestamps can accept date inputs in a range of different formats. 

In [None]:
# convert an indivual date string into a Pandas Timestamp object
# all of these will result in the same Timestamp

print('pd.to_datetime()')
print(pd.to_datetime('2025-01-01')) 
print(pd.to_datetime('2025/01/01')) 
print(pd.to_datetime('1/1/2025')) 
print(pd.to_datetime('2025.01.01')) 
print(pd.to_datetime('Jan 1, 2025')) 
print(pd.to_datetime('20250101')) 

print('pd.Timestamp()')
print(pd.Timestamp('2025-01-01')) 
print(pd.Timestamp('2025/01/01')) 
print(pd.Timestamp('1/1/2025')) 
print(pd.Timestamp('2025.01.01')) 
print(pd.Timestamp('Jan 1, 2025')) 
print(pd.Timestamp('20250101')) 
print(pd.Timestamp(2025,1,1)) 

There are some differences between these two functions in what types of date inputs can be accepted though. See the Pandas API Reference for [```pd.Timestamp()```](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html) and [```pd.to_datetime()```](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) for details.

Notice how Timestamp objects will always display like ```YYYY-MM-DD hh:mm:ss``` even if we don't provide hours, minutes, and seconds, they will default to zero. Also, if we leave off the month or day, Pandas will default to the first month and first day.

In [None]:
# Pandas will fill out the rest of the Timestamp if we don't provide it in the input string
print(pd.Timestamp('2025-01')) 
print(pd.Timestamp('2025')) 

Inputting a sequence of dates into ```pd.to_datetime()``` will return a DatetimeIndex object containing data of type datetime64. The [ns] next to the data type below indicates the precision of the datetime is nanoseconds.

In [None]:
dates = ['2025-01-15','2025-03-12','2025-10-02','2025-02-28','2025-12-20']
pd.to_datetime(dates)

We can create a sequence of datetimes from a starting point to an ending point with the ```pd.date_range()``` function. The frequency parameter lets us easily create sequences of datetimes spaced at intervals. 

Below we create DatetimeIndex objects with daily, monthly, and 6-hourly frequency. All the available frequency options are listed in the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases). 

In [None]:
# create daily datetimes (D = daily)
print(pd.date_range('2025-01-01','2025-01-31',freq='D')) 

# create monthly datetimes at the start of each month (MS = month start)
print(pd.date_range('2025-01-01','2025-12-31',freq='MS'))

# create hourly datetimes every 6 hours (6h = 6-hourly)
print(pd.date_range('2025-01-01 00', '2025-01-01 18',freq='6h'))

As we see above with 6 hourly frequency, an integer can be added into the frequency string to create datetimes spaced at multiples of hours (or minutes, days, months, etc).

## Datetime Properties

Datetimes have components and properties that we can access by calling various methods directly on the datetime object. The full list of datetime properties is located in the [Pandas User Guide section on timeseries](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components).  

In [None]:
# a Timestamp object for demonstration
one_timestamp = pd.to_datetime('2025-05-01')
print(one_timestamp)

# a DatetimeIndex object for demonstration
datetime_index = pd.date_range('2025-01-01','2025-12-31',freq='MS')
print(datetime_index)

In [None]:
# accessing properties of a Timestamp object
print(one_timestamp)
print('-------------------')
print('year', one_timestamp.year)
print('month', one_timestamp.month)
print('day', one_timestamp.day)
print('hour', one_timestamp.hour)
print('day of year', one_timestamp.dayofyear)
print('day of week', one_timestamp.dayofweek)
print('quarter', one_timestamp.quarter)

In [None]:
# accessing properties of a DatetimeIndex object
print(datetime_index)
print('--------------------------------------------------------------------------------')
print('year', datetime_index.year)
print('month', datetime_index.month)
print('day', datetime_index.day)
print('hour', datetime_index.hour)
print('day of year', datetime_index.dayofyear)
print('day of week', datetime_index.dayofweek)
print('quarter', datetime_index.quarter)

Timestamp and DatetimeIndex objects can also exist inside of Pandas Series and DataFrame structures. A DatetimeIndex could be a column of our DataFrame (Series) or we could use it to time-index our DataFrame (which we'll cover in the next subsection). If our datetimes are a Series in a DataFrame, we can access the object properties through the ```.dt``` accessor.

In [None]:
# make DatetimeIndex a Series in a DataFrame for demonstration
df = pd.DataFrame(datetime_index,columns=['DATE'])
df.head()

In [None]:
# access components of datetimes when they are stored in a Series
# a Series is returned
df.DATE.dt.year

## Timedeltas for Math with Dates

*Timedelta* objects (data type timedelta64) are differences in datetimes, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.

We can use timedeltas to add or subtract a fixed amount of time from a datetime.

In [None]:
# create timedelta object
delta = pd.Timedelta('6h')
delta

In [None]:
# add timedelta to datetimes
datetime_index + delta

In [None]:
# subtract timedelta from datetimes
datetime_index - delta

Notice how easy it was to subtract 6 hours from these datetimes. If we kept our dates as strings instead of datetimes, this task would require us to write a significant amount of code. Datetimes and timedeltas allow us to use simple addition and subtraction instead of having to code up something much more complicated.

## Working with Time-Indexed Data

One of the most powerful applications of Pandas datetimes is for time-indexed data. This means using the DatetimeIndex object as the index in a DataFrame. This would be appropriate for data that occur over time where time has significant meaning to the data values, like daily observations of tornado occurrences, for example, or any other timeseries of data.

For this section we'll work with daily severe weather counts (tornados, severe wind, and severe hail) for the state of Mississippi. This data was obtained for the years 2004-2023 from the [NOAA National Weather Service Storm Prediction Center website](https://www.spc.noaa.gov/climo/summary/) and compiled into a single data file ```data/NOAA_SevereWeather/NOAA-SPC_SevereWxCounts_MS_2004-2023.csv``` for our use here.

Some of the things we can do with time-indexed data like the severe weather counts are:
- resampling in time, e.g., daily counts --> annual counts
- grouping in time, e.g., find the long-term average number of tornados that occur in each month
- math with dates, e.g., find the number of days between tornado occurrences

First, we'll load the data into a DataFrame. We can tell Pandas to create datetimes through the use of parameters in the ```pd.read_csv()``` function. To use the "Date" column of data as the index of the DataFrame we can use the parameter ```index_col='Date'```. To turn the dates into datetimes (a DatetimeIndex) we can use the parameter ```parse_dates=['Date']```.

In [None]:
# read data into DataFrame, converting dates to datetimes and using them as the df index
wx_df = pd.read_csv('data/NOAA_SevereWeather/NOAA-SPC_SevereWxCounts_MS_2004-2023.csv', 
                 usecols=['Date','Tornado','Wind','Hail'], 
                 parse_dates = ['Date'],
                 index_col='Date')
wx_df

There are 20 years of data: 20 * 365 days + 5 leap days = 7305 rows of data.

All the data columns are counts of severe weather occurrences, so if there are no messy data values then Pandas should have assigned those columns an integer data type. Let's double check the type of the data columns and index.

In [None]:
# look at data types
print(wx_df.dtypes)
type(wx_df.index)

We now have a time-indexed DataFrame of daily counts of severe weather occurrences in the state of Mississippi from 2004-2023. 

### Resampling in Time

Because the index of our DataFrame is datetimes, we can easily resample the data in time, e.g., daily counts --> annual counts.

Pandas DataFrame ```.resample()``` works similarly to ```.groupby()```. ```.groupby()``` works on a column of data given a certain condition, whereas ```.resample()``` works on a DatetimeIndex given a certain time alias. Here, we use the alias ```'A'``` for annual. The list of resampling alias options can be found in the [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#period-aliases)

In [None]:
# resample daily to annual counts
df_annual = wx_df.resample('A').sum()
df_annual

Notice that just like with ```.groupby()``` we need to also apply some sort of mathematical function like ```.sum()```.

<div class="alert alert-info"> 

### Exercise 7: Resample data using datetimes

Resample ```wx_df``` to obtain a DataFrame where each row contains the sum of one month of each type severe weather observation (your result should retain the columns "Tornado", "Wind", and "Hail").
</div>

In [None]:
# add your code here

wx_df.resample('M').sum()

### Grouping in Time

We can also use datetimes to easily group in time to calculate things like the long-term average number of occurences of each type of severe weather per month of the year.

First, we'll programmatically get the number of data years in the dataset by accessing the ```.year``` property of the DatetimeIndex and then use the Pandas function ```.nunique()``` to get the set of unique years in the data (total number of years).

In [None]:
# programmatically determine number of years in the dataset
nyears = df.index.year.nunique()
nyears

Now, we can group the entire dataset by month, sum the values in group, and divide by the total number of years in the dataset to get the average number of occurences of each type of severe weather per month of the year. 

In [None]:
# calculate long term monthly means
df_monthly_mean = df.groupby(df.index.month).sum() / nyears
df_monthly_mean

### Differencing Dates

Another very useful aspect of datetimes is how we can difference them. When we difference two datetimes, the result is a timedelta. Let's look at an example. We'll difference consecutive dates of tornado occurrences to get the number of days between tornados. First, let's subset our daily severe weather data to only the tornado data and drop all records when there were no tornados.

In [None]:
# new DataFrame with only the index and Tornado column
# double square bracket means return a DataFrame instead of a Series
tornados = df[['Tornado']]

# drop all rows with zero tornados
tornados = tornados.query('Tornado != 0')

tornados

Now we can use the ```.diff()``` function on the DataFrame DatetimeIndex. 

In [None]:
tornados['timedelta_since_tornados'] = tornados.index.diff()
print(tornados.dtypes)
tornados

Pandas executes ```.diff()``` on our DataFrame index as ```diff[i] = index[i] - index[i-1]``` (subtracting the previous index value). That's why the first result is the missing value NaT which stands for "Not a Time", the datetime equivalent of NaN.

Notice the data type of the "timedelta_since_tornados" column is data type timedelta64. To convert timedelta objects that are in a Series to a numerical data type we can use the ```.dt``` accessor. ```.dt.days``` will pull out the day component of the timedelta objects into a numerical data type. The list of attributes you can access from timedeltas can be found in the [Pandas API Reference for ```pd.Timedelta()```](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html). 

In [None]:
tornados['days_since_tornados'] = tornados['timedelta_since_tornados'].dt.days
print(tornados.dtypes)
tornados

<div class="alert alert-info"> 
    
# IX. Exercise: Putting it All Together

Use Pandas to read, clean, manipulate, and aggregate weather observations.

## Read the data file
Read the file at ```data/weatherdata.csv``` into a Pandas DataFrame and render the DataFrame to the screen. Don't specify any parameters besides the filename when reading the file into a DataFrame.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 

## Clean the data

Look at all the columns of data. Do you see any mistakes?

Replace 'LosAngeles' at index 3 with 'Los Angeles'.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
Show the data type of each column. 
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
What data type is the "windspeed_knots" column? 
</div>

Type your answer:

<div class="alert alert-info">
Why did Pandas assign that data type to "windspeed_knots"?
</div>

Type your answer:

<div class="alert alert-info">
Judging by the data values in the "windspeed_knots" column, what data type should "windspeed_knots" probably be?
</div>

Type your answer:

<div class="alert alert-info">
Force the "windspeed_knots" column to be numeric, then show that the data type of the column did in fact change.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
## Create a new column of data

Create a column of boolean data called "IsRainy" that indicates whether there was precipitation.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 

## Calculate average temperature by city

Calculate the average temperature for each city and save the result as new variable called ```avgT```.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
Looking at the ```df``` DataFrame and the ```avgT``` results, how many data values were used to calculate the average New York temperature? 
</div>

Type your answer:

<div class="alert alert-info"> 

## Convert date strings to datetimes

Convert the string values in the "dates" column to a DatetimeIndex. You don't need to reset the DataFrame's index to the "dates" column though, just convert the string data in the "dates" column to datetimes.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
## Sort the DataFrame by date, ascending

</div>

In [None]:
# add your code here


<div class="alert alert-danger">

**Sidebar about sorting:** If you have dates in your data, it's best to convert them from string values to datetime objects. Remember how we saw earlier what can happen when sorting strings that contain numbers? If we had full months of date strings, sorting would be problematic due to the lexicographic sort order. This problem is avoided completely if you work with dates as datetimes instead of strings. 
</div>

<div class="alert alert-info"> 
    
## Write the DataFrame to file

Write the ```df``` data to a file called ```weatherdata_yourname.csv```, replacing yourname with your first name. Do not include the index column, but do include the column names.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 

Does your csv file look like this inside?

<img src="images/weatherdata.png" alt="contents of cvs file" width="400"/>

If so, congratulations! You've successfully completed this exercise.
</div>

# X. At a Glance: Language Covered

The Pandas functionality that we covered at a glance...

## Pandas functions

```pd.DataFrame()```, ```pd.read_csv()```, ```pd.to_numeric()```, ```pd.merge()```, ```pd.to_datetime()```, ```pd.Timestamp()```, ```pd.date_range()```, ```pd.Timedelta()```


## Pandas data structure (DataFrame or Series) methods 

```.head()```, ```.tail()```, ```.describe()```, ```.info()```, ```.unique()```, ```.query()```, ```.mean()```, ```.median()```, ```.max()```, ```.std()```, ```.sum()```, ```.sort_values()```, ```.groupby()```, ```.idxmax()```, ```.rename()```, ```.isna()```, ```.fillna()```, ```.dropna()```, ```.to_csv()```, ```.resample()```, ```.diff()```, ```.nunique()```

## Pandas data structure (DataFrame or Series) attributes and accessors

```.shape```, ```.dtypes```, ```.loc```, ```.iloc```, ```.dt```


<div class="alert alert-success">

# XI. Learning More About Pandas

For more about Pandas, start on the Pandas website where you can find:

- a nice cheat sheet https://pandas.pydata.org/docs/getting_started/index.html
- a long list of community developed tutorials https://pandas.pydata.org/docs/getting_started/tutorials.html#communitytutorials
- the user guide, which contains a bunch of 10 minute learning guides as well as more in-depth guides by topic https://pandas.pydata.org/docs/user_guide/index.html
</div>