# Introduction to Pandas for Working with Tabular Data

<div class="alert alert-success">
    
## This notebook covers
- Pandas data structures - dataframes and series
- Selecting, slicing, and querying dataframes
- Simple calculations with summary functions
- Sorting and grouping data
- Copying and renaming dataframe columns
- Handling missing values
- Merging dataframes and writing to file
</div>

<div class="alert alert-warning">

## Reminders

Remember, you can use Jupyter's built-in table of contents (hamburger on the far left) to jump from heading to heading.

---

This notebook should run in the Anaconda base environment. We'll discuss more about environments later, but for now look for something like the words "Python3" or "base" at the top right of this notebook. If it says "No Kernel", go to the Kernel tab, select Change Kernel, then select the Python3 or base kernel in the pop up window.

---

To run cells in this notebook place your cursor in the cell you want to run, then hit Shift+Enter.

---

To turn on line number for code cells go to View menu and click Show Line Numbers.

</div>

# I. Importing Necessary Packages
The following code will load the packages you'll need for this notebook. Packages are collections of code that adds additional functionality to the core Python. It's best practice to import everything you need in one place at the top of your notebook or script.

The "as pd" part of the pandas import statement below is giving the Pandas package an alias in our notebook. This way when we want to use functions from the Pandas package we can type, for example, ```pd.DataFrame()``` instead of the longer ```pandas.DataFrame()```. An alias can be anything really, but "pd" is the alias that the Pandas user community has settled on.

This particular package was installed to your computer during the installation of Anaconda, so we can simply import it here instead of having to take the extra step of downloading it first. We'll cover how to download and import additional packages in a subsequent notebook.

In [2]:
import pandas as pd

Packages that extend the core python language, such as the one we imported above usually have a website where you can find tons of helpful information. Package websites may include, for example, "Getting Started" tutorials, in-depth user guides, an API reference that documents the particulars of every single available function, and instructions on where to ask the user community questions, submit bug reports, or make software contributions. 

**If you need help, package websites are one of the first places you should look. Let's take a quick look at the Pandas website [https://pandas.pydata.org/](https://pandas.pydata.org/)**.

# II. Introduction to Pandas Data Structures

On the Pandas website, the package developers describe the project's goal: pandas "aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language."

Pandas is a powerful tool for working with tabular data such as data stored in spreadsheets, databases, or other table-like formats. The main data structure Pandas used to hold data is called the *DataFrame*. A DataFrame is a 2-dimensional (rows and columns) structure that can store data of different types including strings, integers, floating point values, categorical data and more. DataFrames are like a spreadsheet - think of a table of data with headings and values. Each column of a Pandas DataFrame is its own data structure called a *Series*. Both data structures (DataFrame and Series) have an *index*. You can think of an index, for now, as a row or line number, but it can be anything, even text.

Below are schematics of what a Pandas DataFrame and Series look like, where the darker grey boxes would hold headers (column names) and indexes (row names/numbers), while the lighter grey boxes would hold the data values.

<table><tr>
<!-- <td> <img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg" alt="schematic of a dataframe" width="700"/> </td>
<td> <img src="https://pandas.pydata.org/docs/_images/01_table_series.svg" alt="schematic of a series" width="200"/> </td> -->
<td> <img src="images/01_table_dataframe.svg" alt="schematic of a dataframe" width="700"/> </td>
<td> <img src="images/01_table_series.svg" alt="schematic of a series" width="200"/> </td>
</tr></table
         
(Image Source: [Pandas Docs Getting Started Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html))

## Making a DataFrame from Scratch

You can make a dataframe programmatically, as opposed to reading data from a file. For example, let's create one from lists.

In [None]:
listOfTuples = [(1,'089',32223,59730,98468,35297),
                (2,'121',27183,56159,145165,52539),
                (3,'073',27399,50075,57786,21237),
                (4,'033',25065,59734,166234,56641),
                (5,'059',23547,49620,140298,50185),
                (6,'047',23111,44550,194029,69384),
                (7,'149',22079,40404,48773,18941),
                (8,'045',21935,44494,43929,17380),
                (9,'081',21831,39049,82910,32086),
                (10,'131',21691,43728,17786,6165)]

listOfColumnNames = ["rank", "county_fips",
                     "per_capita_income", "median_household_income",
                     "population", "num_households"]

countyData = pd.DataFrame(listOfTuples, columns = listOfColumnNames)
        
# if placed on a line by itself, you will get pretty output of the dataframe    
countyData

Our data is now in a Pandas dataframe, which has 6 named columns of data (6 series) as well as a row index (the bold column without a header).

Isolating a series from a dataframe would look like this:

In [None]:
countyData["population"]

You can see that the rendering of a series isn't pretty like that of a dataframe. But notice how a series isn't just the data values. The index remains attached and the column name is there as well. How do I know this is a Pandas series? We can use python's built-in function ```type()``` like we did in the previous notebook.

In [None]:
type(countyData["population"])

And for good measure...

In [None]:
type(countyData)

## Loading Tabular Data from a File into a DataFrame

Pandas makes it very easy to load an excel file or other tabular data sources like .csv files into dataframes. 

The .csv file extension is a common file format for tabular data. CSV stands for comma separated values. Inside a CSV file, you will see rows of data with commas used between the values of each data column (i.e., "comma delimited"). Generally, we use .csv files instead of excel because excel has a limit on length (1,048,576 rows). (Also, reading an excel file with Pandas requires installation of an additional package. We'll cover that in the course module "Input/Output of Different Data Formats").

A raw CSV file with a header row might look like this, for example:

<pre>
Book Title,Publisher,Price
War and Peace,Vintage Classics,12.99
"Our Bodies, Ourselves",Touchstone,48.38
Putin's Playbook,"Simon & Schuster, Inc.",14.49
</pre>

Notice that the second book title listed and the third publisher name have a comma in the data value. Those values are surrounded by quotation marks to avoid Pandas interpreting the comma as a new column. This is important to do when you are creating your own csv data.

Let's begin by loading an example data file that is .csv format into a Pandas dataframe. 

The example data used in this notebook is college football bowl data. While this particular data may not be relevant to your job or research, the data exploration and cleaning techniques we'll work through do broadly apply to any tabular data you may have (in .csv, .xls(x) or even .txt formats).  


In [71]:
bowlData = pd.read_csv('data/collegefootballbowl.csv')
bowlData

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
0,1,2021,12/29/2021,Wed,Oklahoma,14,47,Oregon,15,32,59121.0,"Oklahoma RB Kennedy Brooks, Oklahoma S Pat Fields",Valero,Alamo Bowl
1,2,2020,12/29/2020,Tue,Texas,20,55,Colorado,,23,10822.0,"Texas RB Bijan Robinson, Texas LB DeMarvion Ov...",Valero,Alamo Bowl
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl
3,4,2018,12/28/2018,Fri,Washington State,12,28,Iowa State,25,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
4,5,2017,12/28/2017,Thu,Texas Christian,13,39,Stanford,15,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl


Notice what gets printed to the screen when a dataframe has many rows. We can see the first 5 and last 5 rows of data, followed by the shape (rows, columns) of the full dataframe.

Also notice some columns contain NaN values. NaN stands for "not a number" and usually represents a missing data value of type float. We'll cover more about different missing data values, how to handle them, and the missing data features Pandas offers later.


<div class="alert alert-danger">
    
**Sidebar about data management for tabular data:** It's important to, at minimum, create a data dictionary that describes what is in your data file, even if your data file contains column names. Often there is more information (metadata) that is required by future users of the data than just the column names and data values. Metadata for the collegefootballbowl.csv can be found in the [collegefootballbowl.txt](data/collegefootballbowl.txt) file, which tells future data users the original source of the data, when the data file was created, and contains a data dictionary that describes each column of data. This type of information is super important especially when your data values have units. Even future you may forget if your data is in Celsius or Fahrenheit, or if your precipitation data had units of cm, mm, inches, or hundredths of inches! 
</div>

## Viewing the Head and Tail of a DataFrame

```.head()``` lets us view the first N rows of a dataframe 

```.tail()``` lets us view the last N rows

Using either one of these methods on a dataframe without any additional parameters will show you 5 rows. Enter an integer as a parameter to either function to view a different number of rows.



In [72]:
# view first 10 rows
bowlData.head(10)

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
0,1,2021,12/29/2021,Wed,Oklahoma,14.0,47,Oregon,15.0,32,59121.0,"Oklahoma RB Kennedy Brooks, Oklahoma S Pat Fields",Valero,Alamo Bowl
1,2,2020,12/29/2020,Tue,Texas,20.0,55,Colorado,,23,10822.0,"Texas RB Bijan Robinson, Texas LB DeMarvion Ov...",Valero,Alamo Bowl
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12.0,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
4,5,2017,12/28/2017,Thu,Texas Christian,13.0,39,Stanford,15.0,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
6,7,2015,1/2/2016,Sat,Texas Christian,11.0,47,Oregon,15.0,41,64569.0,"TCU QB Brian Kohlhausen, TCU S Travin Howard",Valero,Alamo Bowl
7,8,2014,1/2/2015,Fri,UCLA,14.0,40,Kansas State,11.0,35,60517.0,"RB Paul Perkins (UCLA), LB Eric Kendricks (UCLA)",Valero,Alamo Bowl
8,9,2013,12/30/2013,Mon,Oregon,10.0,30,Texas,,7,65918.0,"QB Marcus Mariota (Oregon), SS Avery Patterson...",Valero Energy Corporation,Alamo Bowl
9,10,2012,12/29/2012,Sat,Texas,,31,Oregon State,15.0,27,65277.0,"WR Marquise Goodwin (Texas), DL Alex Okafor (T...",Valero Energy Corporation,Alamo Bowl


In [73]:
# view last 10 rows
bowlData.tail(10)

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
1517,1518,2001,12/27/2001,Thu,Georgia Tech,,24,Stanford,11.0,14,30144.0,,Jeep,Seattle Bowl
1518,1519,2000,12/24/2000,Sun,Georgia,24.0,37,Virginia,,14,24187.0,,Jeep,Seattle Bowl
1519,1520,1999,12/25/1999,Sat,Hawaii,,23,Oregon State,,17,40974.0,,Jeep,Seattle Bowl
1520,1521,1998,12/25/1998,Fri,Air Force,16.0,45,Washington,,25,46451.0,,Jeep,Seattle Bowl
1521,1522,1948,12/18/1948,Sat,Hardin-Simmons,,40,Ouachita,,12,,,,Shrine Bowl
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20.0,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl
1526,1527,2000,12/31/2000,Sun,Air Force,,37,Fresno State,,34,26542.0,"QB Mike Thiessen (Air Force), LB Tim Skipper (...",,Silicon Valley Bowl


## Getting DataFrame Information

### Shape Property: How Many Rows / Columns?

```.shape``` returns a tuple (rows, columns) and is the most concise way to see how many total rows and columns are in a dataframe. 

In [74]:
bowlData.shape

(1527, 14)

### Dtypes Property: Understanding the Data Types of Each Column

Often you will need to know the data type of each column, ```.dtypes``` will give you that information.

Anything column that says "object" is probably storing strings. int64 and float64 are numerical data (numbers).

In [75]:
bowlData.dtypes

id                 int64
year               int64
date              object
day               object
winner_tie        object
winner_rank       object
winner_points      int64
loser_tie         object
loser_rank        object
loser_points       int64
attendance       float64
mvp               object
sponsor           object
bowl_name         object
dtype: object

How were the data types for each column determined? 

When we read the .csv file using ```pd.read_csv()``` Pandas inferred the data type of each column based on the column's data values. If all values in a column appear to be integers, Pandas will infer that the column is data type int. Sometimes, there can be mistakes in your data file though. You could have a data column, for example winner_rank, that should contain all integers or missing values but a data entry mistake in your file has added 1 or more values that are non-numeric in that column. A lot of real data is "messy" like this. In these cases where there are mixed data types in a single column, Pandas usually reads the whole column as strings and assigns a data type of "object". This is in fact the case with our data columns winner_rank and loser_rank, which we will investigate a bit more later.

### Describe Method: Summary of Numerical Data (count, mean, std, min, quartiles, max)

You can access simple statistical information for numerical data columns using the ```.describe()``` method.

In [76]:
bowlData.describe()

Unnamed: 0,id,year,winner_points,loser_points,attendance
count,1527.0,1527.0,1527.0,1527.0,1518.0
mean,764.0,1991.286182,30.253438,17.092338,49487.57444
std,440.951244,24.4379,12.111077,10.395141,23552.602532
min,1.0,1901.0,0.0,0.0,0.0
25%,382.5,1976.0,21.0,10.0,31383.0
50%,764.0,1998.0,30.0,16.0,49056.0
75%,1145.5,2011.0,38.0,24.0,68321.5
max,1527.0,2021.0,70.0,61.0,106869.0


Notice that ```.describe()``` returns statistical information only for numerical data columns. 

For which year was the earliest data record in our dataframe collected? We can see the answer is the "min" of the year column, 1901.

To see all columns, numeric or not, you can add a parameter to ```.describe()``` like this:

In [77]:
bowlData.describe(include = 'all')

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
count,1527.0,1527.0,1527,1527,1527,754.0,1527.0,1527,671.0,1527.0,1518.0,1358,820,1527
unique,,,691,7,166,26.0,,171,26.0,,,1331,166,77
top,,,1/1/1948,Sat,Alabama,3.0,,Alabama,2.0,,,QB Byron Leftwich (Marshall),Outback Steakhouse,Rose Bowl
freq,,,11,412,45,49.0,,30,38.0,,,3,26,108
mean,764.0,1991.286182,,,,,30.253438,,,17.092338,49487.57444,,,
std,440.951244,24.4379,,,,,12.111077,,,10.395141,23552.602532,,,
min,1.0,1901.0,,,,,0.0,,,0.0,0.0,,,
25%,382.5,1976.0,,,,,21.0,,,10.0,31383.0,,,
50%,764.0,1998.0,,,,,30.0,,,16.0,49056.0,,,
75%,1145.5,2011.0,,,,,38.0,,,24.0,68321.5,,,


You can now see 3 additional statistics that operate only on non-numeric data columns: unique, top, and freq. NaN appears wherever the data type is not appropriate for the statistic. 

Who is the most common sponsor and how many times were they a sponsor? 

Looking to the sponsor column, the "top" row indicates the most common data value is Outback Steakhouse and the "freq" row indicates that Outback Steakhouse was a sponsor 26 times.


### Info Method: Understanding all Fields, Null Values, Dtypes, Shape, Size, etc.

In [78]:
bowlData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1527 entries, 0 to 1526
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1527 non-null   int64  
 1   year           1527 non-null   int64  
 2   date           1527 non-null   object 
 3   day            1527 non-null   object 
 4   winner_tie     1527 non-null   object 
 5   winner_rank    754 non-null    object 
 6   winner_points  1527 non-null   int64  
 7   loser_tie      1527 non-null   object 
 8   loser_rank     671 non-null    object 
 9   loser_points   1527 non-null   int64  
 10  attendance     1518 non-null   float64
 11  mvp            1358 non-null   object 
 12  sponsor        820 non-null    object 
 13  bowl_name      1527 non-null   object 
dtypes: float64(1), int64(4), object(9)
memory usage: 167.1+ KB


# III. Selecting/Slicing Data with .loc[]

The Pandas function ```.loc[]``` allows us to directly access rows by index and columns by name ```.loc[row_index,column_name]``` or to access all rows of data based on a conditional ```.loc[condition]```. Let's take a look, starting with selecting by row index and column name.

## Single Cell

In [79]:
# SELECT a single cell - the attendance column where row index is 1
bowlData.loc[1, 'attendance']

10822.0

If you have an integer in the "row" part of ```.loc[]```, Pandas assumes this is the *index* of the row.

## Single row

In [80]:
# SELECT whole row of data where row index is 1
bowlData.loc[1, :]

id                                                               2
year                                                          2020
date                                                    12/29/2020
day                                                            Tue
winner_tie                                                   Texas
winner_rank                                                     20
winner_points                                                   55
loser_tie                                                 Colorado
loser_rank                                                     NaN
loser_points                                                    23
attendance                                                 10822.0
mvp              Texas RB Bijan Robinson, Texas LB DeMarvion Ov...
sponsor                                                     Valero
bowl_name                                               Alamo Bowl
Name: 1, dtype: object

A colon by itself means "everything", here specifically it means all columns.

## Single Column

In [81]:
# SELECT a single column of data, just the atttendance column
bowlData.loc[:, 'attendance']

0       59121.0
1       10822.0
2       60147.0
3       60675.0
4       57653.0
         ...   
1522    21456.0
1523    20126.0
1524    10132.0
1525    30456.0
1526    26542.0
Name: attendance, Length: 1527, dtype: float64

Here the colon by itself means all rows.

## Slice of Rows

In [82]:
# SELECT a slice of rows where row index is 1 to 6
bowlData.loc[1:6, :]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
1,2,2020,12/29/2020,Tue,Texas,20.0,55,Colorado,,23,10822.0,"Texas RB Bijan Robinson, Texas LB DeMarvion Ov...",Valero,Alamo Bowl
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12.0,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
4,5,2017,12/28/2017,Thu,Texas Christian,13.0,39,Stanford,15.0,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
6,7,2015,1/2/2016,Sat,Texas Christian,11.0,47,Oregon,15.0,41,64569.0,"TCU QB Brian Kohlhausen, TCU S Travin Howard",Valero,Alamo Bowl


A colon between two integers means a slice (here, meaning get multiple rows).

Notice that a slice of a Pandas dataframe is inclusive of the ending row index 6, such that a slice of rows 1:6 returns 6 rows. Just a quick note that this is not always the case with other Python packages. Array slicing with the Numpy or Xarray packages, for example, are exclusive of the ending index, but we'll cover that in a different notebook.

## All Rows, Slice of Columns

In [83]:
# SELECT all rows but only a slice of columns from year to winner_rank 
bowlData.loc[:, 'year':'winner_rank']

Unnamed: 0,year,date,day,winner_tie,winner_rank
0,2021,12/29/2021,Wed,Oklahoma,14
1,2020,12/29/2020,Tue,Texas,20
2,2019,12/31/2019,Tue,Texas,
3,2018,12/28/2018,Fri,Washington State,12
4,2017,12/28/2017,Thu,Texas Christian,13
...,...,...,...,...,...
1522,2004,12/30/2004,Thu,Northern Illinois,
1523,2003,12/30/2003,Tue,Fresno State,
1524,2002,12/31/2002,Tue,Fresno State,
1525,2001,12/31/2001,Mon,Michigan State,


Notice that when your output exceeds twenty lines, Jupyter will format nicely and show a bit at the ending and a bit at the end. It will do the same thing if your output has too many columns.

## Slice of Rows, Slice of Columns

In [84]:
# SELECT a slice of rows (index 10 to 15) and a slice of columns ('year' to 'winner_rank')
bowlData.loc[10:15, 'year':'winner_rank']

Unnamed: 0,year,date,day,winner_tie,winner_rank
10,2011,12/29/2011,Thu,Baylor,15.0
11,2010,12/29/2010,Wed,Oklahoma State,16.0
12,2009,1/2/2010,Sat,Texas Tech,
13,2008,12/29/2008,Mon,Missouri,25.0
14,2007,12/29/2007,Sat,Penn State,
15,2006,12/30/2006,Sat,Texas,18.0


## Slice of Rows, Particular Columns

In [85]:
# SELECT a slice of rows and two specific columns
bowlData.loc[10:15, ['year', 'winner_rank']]

Unnamed: 0,year,winner_rank
10,2011,15.0
11,2010,16.0
12,2009,
13,2008,25.0
14,2007,
15,2006,18.0


## All Rows Where Column has Certain Value

Now, instead of putting row indexes and column names in ```.loc[]``` we'll use a condition instead.

In [86]:
# find all rows where the year column is equal to 1901
bowlData.loc[bowlData['year'] == 1901]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
1020,1021,1901,1/1/1902,Wed,Michigan,,49,Stanford,,0,8000.0,RB Neil Snow (Michigan),,Rose Bowl


This returned only one row. If there were more rows where the year column was equal to 1901, then there would be more rows returned.

In [87]:
# find all rows where the sponsor column is missing data
bowlData.loc[bowlData['sponsor'].isna()]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
15,16,2006,12/30/2006,Sat,Texas,18,26,Iowa,,24,65875.0,"QB Colt McCoy (Texas), DB Aaron Ross (Texas)",,Alamo Bowl
45,46,2005,12/23/2005,Fri,Kansas,,42,Houston,,13,33505.0,QB Jason Swanson (Kansas),,Armed Forces Bowl
48,49,2021,12/17/2021,Fri,Middle Tennessee State,,31,Toledo,,24,13596.0,"MTSU QB Nicholas Vattiato, MTSU LB DQ Thomas",,Bahamas Bowl
51,52,2017,12/22/2017,Fri,Ohio,,41,Alabama-Birmingham,,6,13585.0,"Ohio RB Dorian Brown, Ohio FS Javon Hagan",,Bahamas Bowl
59,60,2016,12/26/2016,Mon,Mississippi State,,17,Miami,,16,15717.0,MSU QB Nick Fitzgerald,,Union Home Mortgage Gasparilla Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl


## Rows Based on Column Comparison

If we want to return all the games where the winner was ranked below (higher number) the loser (lower number), we could do it this way.

In [88]:
# find all rows where the winner ranked below the loser
bowlData.loc[bowlData['winner_rank'] > bowlData['loser_rank']]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
5,6,2016,12/29/2016,Thu,Oklahoma State,13,38,Colorado,11,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
7,8,2014,1/2/2015,Fri,UCLA,14,40,Kansas State,11,35,60517.0,"RB Paul Perkins (UCLA), LB Eric Kendricks (UCLA)",Valero,Alamo Bowl
13,14,2008,12/29/2008,Mon,Missouri,25,30,Northwestern,22,23,55986.0,"WR Jeremy Maclin (Missouri), LB Sean Witherspo...",Valero Energy Corporation,Alamo Bowl
21,22,2000,12/30/2000,Sat,Nebraska,9,66,Northwestern,18,17,60028.0,"RB Dan Alexander (Nebraska), DL Kyle Vanden Bo...",Sylvania,Alamo Bowl
26,27,1995,12/28/1995,Thu,Texas A&M,19,22,Michigan,14,20,64597.0,"K Kyle Bryant (Texas A&M), LB Keith Mitchell (...",Builders Square,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1422,1423,2001,12/28/2001,Fri,Texas,9,47,Washington,21,43,60548.0,"QB Major Applewhite (Texas), LB Derrick Johnso...",Culligan,Holiday Bowl
1423,1424,2000,12/29/2000,Fri,Oregon,8,35,Texas,12,30,63278.0,"QB Joey Harrington (Oregon), DB Rashad Bauman ...",Culligan,Holiday Bowl
1425,1426,1998,12/30/1998,Wed,Arizona,5,23,Nebraska,14,20,65354.0,"QB Keith Smith (Arizona), DL Mike Rucker (Nebr...",Culligan,Holiday Bowl
1427,1428,1996,12/30/1996,Mon,Colorado,8,33,Washington,13,21,54749.0,"QB Koy Detmer (Colorado), DL Nick Ziegler (Col...",Plymouth,Holiday Bowl


**Be careful in your analysis! Do you notice anything suspicious about the results that were returned?**

We're looking for cases where the underdog won. For example, a team that was ranked 20th, beat the team that was ranked 10th (see index 1429). If you look closely in the results above you can see multiple examples of results returned that are NOT what we asked for. For example, index 1427 shows the winner had a better rank (8th) than the loser (13th). So, we've got incorrect results but did not receive an error message. 

**Why did this happen??** 

This comes back to the data type of the columns winner_rank and loser_rank. A ranking should be a numerical data type (such as int or float). If these columns had a numeric data type then we shouldn't have experienced any problems using the greater than operator. But if those columns contain non-numeric data (dtype object, which usually indicates string data) then that could yield unexpected results when comparing if one string is greater than another. Let's take a look at the data type for winner_rank and loser_rank.

In [89]:
bowlData[['winner_rank', 'loser_rank']].dtypes

winner_rank    object
loser_rank     object
dtype: object

Uh oh! We can see that Pandas determined the data type of those columns to be object, which is non-numeric. We discussed briefly already why this might happen. This should indicate to us that there must be some "messy" data somewhere in those columns, some data values that don't look like numbers. And due to this, Pandas is treating those data columns like strings instead of numbers. So, when we ask if the winner_rank is greater than the loser_rank, the process of comparing two string values with the greater than operator returns unexpected results.

When beginning work with any dataset it is best to do some data cleaning first to make sure your columns are of the expected data types and remove any whacky data values in order to avoid problems like this one. Otherwise, you might easily miss the fact that your code isn't doing what you intended.

Let's see what the problem is, clean up those two columns, ensure their dtype is numeric, and try our selection again.

We'll start by using a pandas function that we haven't seen yet: ```.unique()``` on the winner_rank column of the dataframe.

In [90]:
# look at all the unique data values in the winner_rank column
bowlData['winner_rank'].unique()

array(['14', '20', nan, '12', '13', '11', '10', '15', '16', '25', '18',
       '24', '22', '9', '17', '21', '19', '23', '8', '6', '4', '2',
       'Pennsylvania', '3', '1', '5', '7'], dtype=object)

Look at that! Someone has entered a value of Pennsylvania in the column of winner_rank somewhere in the data file. That doesn't make sense at all. Notice, we also have some missing data (nan), but that shouldn't cause us problems here.

Let's force the winner_rank data column to be numeric using another pandas function that we haven't seen yet: ```pd.to_numeric()```

In [91]:
# reassign all the values in the column winner_rank with numeric data values
# any value that does not look like a number will be changed to the missing data value
bowlData['winner_rank'] = pd.to_numeric(bowlData['winner_rank'], errors = 'coerce')

# look again at the unique data values
bowlData['winner_rank'].unique()

array([14., 20., nan, 12., 13., 11., 10., 15., 16., 25., 18., 24., 22.,
        9., 17., 21., 19., 23.,  8.,  6.,  4.,  2.,  3.,  1.,  5.,  7.])

In [92]:
# look at the new data type
bowlData['winner_rank'].dtype

dtype('float64')

How did we know what parameters to enter in ```pd.to_numeric()```? That information can be found in the Pandas documentation on the [API reference page for pd.to_numeric()](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html). Or you could have done a quick Google search for something like "how to use pandas to_numeric", which will bring up a great AI generated answer (at least it does in the USA at the time this notebook was created), the link to the Pandas API reference page, and many other websites that demonstrate how to use the function.

Let's check out what the problem is with loser_rank.

In [93]:
# look at all the unique data values
bowlData['loser_rank'].unique()

array(['15', nan, '12', '25', '11', '22', '20', '14', '18', '4', '24',
       '19', '17', '13', '16', '23', '21', '7', '9', '10', '6', 'TN', '1',
       '3', '8', '2', '5'], dtype=object)

Again, there's a 'TN' in the data values that doesn't make any sense. We'll do the same process to clean the loser_rank column and convert to a numeric data type.

In [94]:
# reassign all the values in the column with numeric data values
# any value that does not look like a number will be changed to the missing data value
bowlData['loser_rank'] = pd.to_numeric(bowlData['loser_rank'], errors = 'coerce')

# look again at the unique data values
bowlData['loser_rank'].unique()

array([15., nan, 12., 25., 11., 22., 20., 14., 18.,  4., 24., 19., 17.,
       13., 16., 23., 21.,  7.,  9., 10.,  6.,  1.,  3.,  8.,  2.,  5.])

In [95]:
# look at the new data type
bowlData['loser_rank'].dtype

dtype('float64')

Lastly, let's try the row selection by column comparison again.

In [96]:
# finds all rows where the winner ranked below the loser
bowlData.loc[bowlData['winner_rank'] > bowlData['loser_rank']]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
7,8,2014,1/2/2015,Fri,UCLA,14.0,40,Kansas State,11.0,35,60517.0,"RB Paul Perkins (UCLA), LB Eric Kendricks (UCLA)",Valero,Alamo Bowl
13,14,2008,12/29/2008,Mon,Missouri,25.0,30,Northwestern,22.0,23,55986.0,"WR Jeremy Maclin (Missouri), LB Sean Witherspo...",Valero Energy Corporation,Alamo Bowl
26,27,1995,12/28/1995,Thu,Texas A&M,19.0,22,Michigan,14.0,20,64597.0,"K Kyle Bryant (Texas A&M), LB Keith Mitchell (...",Builders Square,Alamo Bowl
132,133,2021,1/1/2022,Sat,Kentucky,25.0,20,Iowa,17.0,17,50769.0,Kentucky WR Wan'Dale Robinson,Vrbo,Citrus Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,1417,2007,12/27/2007,Thu,Texas,17.0,52,Arizona State,12.0,34,64020.0,"QB Colt McCoy (Texas), DL Brian Orakpo (Texas)",Pacific Life Insurance,Holiday Bowl
1419,1420,2004,12/30/2004,Thu,Texas Tech,23.0,45,California,4.0,31,66222.0,"QB Sonny Cumbie (Texas Tech), DB Vincent Meeks...",Pacific Life Insurance,Holiday Bowl
1420,1421,2003,12/30/2003,Tue,Washington State,15.0,28,Texas,5.0,20,61102.0,"WR Sammy Moore (Washington State), P Kyle Basl...",Pacific Life Insurance,Holiday Bowl
1429,1430,1994,12/30/1994,Fri,Michigan,20.0,24,Colorado State,10.0,14,59453.0,"QB Todd Collins (Michigan), QB Anthoney Hill (...",Thrifty Car Rental,Holiday Bowl


Notice how the result now returns 215 rows whereas before we got 255 rows. Our original selection returned 40 incorrect results!

<div class="alert alert-info"> 

## Exercise 1: Selecting with .loc[]

Use ```.loc[]``` to select row indexes 100:110 and the three columns year, winner_points, loser_points from the bowlData dataframe.

</div>

In [97]:
# add your code here


<div class="alert alert-info"> 

Now select the rows where attendance is greater than 100,000.
</div>

In [98]:
# add your code here


<div class="alert alert-info"> 
    
How many games in the dataframe had attendance greater than 100,000? Don't count the rows in your answer above, determine your answer programmatically.
</div>

In [99]:
# add your code here


# IV. Selecting data with .query()

Similar to ```.loc[]```, the ```.query()``` method can be used to select rows in a dataframe based on a condition. Inside the ```.query()``` we can put a condition that looks a little bit like a database query. You may like this method if you have experience working with databases.

## Rows Where Column has Certain Numerical Value

In [100]:
# find all rows where year is equal to 1901
bowlData.query("year == 1901")

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
1020,1021,1901,1/1/1902,Wed,Michigan,,49,Stanford,,0,8000.0,RB Neil Snow (Michigan),,Rose Bowl


## Rows Where Column has String Value

In [101]:
# find all rows where the bowl_name is "Rose Bowl"
# notice the single quotes around the string "Rose Bowl"
bowlData.query("bowl_name == 'Rose Bowl'")

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
913,914,2021,1/1/2022,Sat,Ohio State,7.0,48,Utah,10.0,45,87842.0,Ohio State WR Jaxon Smith-Njigba,Capital One,Rose Bowl
914,915,2020,1/1/2021,Fri,Alabama,1.0,31,Notre Dame,4.0,14,18373.0,"Alabama WR DeVonta Smith, Alabama CB Patrick S...",Capital One,Rose Bowl
915,916,2019,1/1/2020,Wed,Oregon,7.0,28,Wisconsin,11.0,27,90462.0,"Oregon QB Justin Herbert, Oregon S Brady Breeze",Northwestern Mutual,Rose Bowl
916,917,2018,1/1/2019,Tue,Ohio State,5.0,28,Washington,9.0,23,91853.0,"Ohio State QB Dwayne Haskins, Ohio State S Bre...",Northwestern Mutual,Rose Bowl
917,918,2017,1/1/2018,Mon,Georgia,3.0,54,Oklahoma,2.0,48,92844.0,"Georgia RB Sony Michel, Georgia LB Roquan Smith",Northwestern Mutual,Rose Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1016,1017,1918,1/1/1919,Wed,Great Lakes Navy,,17,Mare Island Marines,,0,,WR George Halas (Great Lakes Navy),,Rose Bowl
1017,1018,1917,1/1/1918,Tue,Mare Island Marines,,19,Fort Lewis,,7,,RB Hollis Huntington (Mare Island Marines),,Rose Bowl
1018,1019,1916,1/1/1917,Mon,Oregon,,14,Pennsylvania,,0,27000.0,OL John Beckett (Oregon),,Rose Bowl
1019,1020,1915,1/1/1916,Sat,Washington State,,14,Brown,,0,10000.0,RB Carl Dietz (Washington State),,Rose Bowl


## Rows Based on Substring Comparison

In [102]:
# Grab all rows where the word "State" appears in the winner_tie column
bowlData.query("winner_tie.str.contains('State')")

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
11,12,2010,12/29/2010,Wed,Oklahoma State,16.0,36,Arizona,,10,57593.0,"WR Justin Blackmon (Oklahoma State), DB Markel...",Valero Energy Corporation,Alamo Bowl
14,15,2007,12/29/2007,Sat,Penn State,,24,Texas A&M,,17,66166.0,"RB Rodney Kinlaw (Penn State), LB Sean Lee (Pe...",Valero Energy Corporation,Alamo Bowl
17,18,2004,12/29/2004,Wed,Ohio State,24.0,33,Oklahoma State,,7,65265.0,"WR Ted Ginn Jr. (Ohio State), DL Simon Fraser ...",MasterCard,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1499,1500,2013,12/26/2013,Thu,Utah State,,21,Northern Illinois,24.0,14,23408.0,"RB Joey DeMartino (Utah State), LB Jake Dought...",San Diego County Credit Union,Poinsettia Bowl
1502,1503,2010,12/23/2010,Thu,San Diego State,,35,Navy,,14,48049.0,"RB Ronnie Hillman (San Diego State), WR Vincen...",San Diego County Credit Union,Poinsettia Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl


## Rows Based on Column Comparison

In [103]:
# find all rows where the loser ranked higher (smaller number) than the winner (larger number)
bowlData.query("loser_rank < winner_rank")


Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
7,8,2014,1/2/2015,Fri,UCLA,14.0,40,Kansas State,11.0,35,60517.0,"RB Paul Perkins (UCLA), LB Eric Kendricks (UCLA)",Valero,Alamo Bowl
13,14,2008,12/29/2008,Mon,Missouri,25.0,30,Northwestern,22.0,23,55986.0,"WR Jeremy Maclin (Missouri), LB Sean Witherspo...",Valero Energy Corporation,Alamo Bowl
26,27,1995,12/28/1995,Thu,Texas A&M,19.0,22,Michigan,14.0,20,64597.0,"K Kyle Bryant (Texas A&M), LB Keith Mitchell (...",Builders Square,Alamo Bowl
132,133,2021,1/1/2022,Sat,Kentucky,25.0,20,Iowa,17.0,17,50769.0,Kentucky WR Wan'Dale Robinson,Vrbo,Citrus Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,1417,2007,12/27/2007,Thu,Texas,17.0,52,Arizona State,12.0,34,64020.0,"QB Colt McCoy (Texas), DL Brian Orakpo (Texas)",Pacific Life Insurance,Holiday Bowl
1419,1420,2004,12/30/2004,Thu,Texas Tech,23.0,45,California,4.0,31,66222.0,"QB Sonny Cumbie (Texas Tech), DB Vincent Meeks...",Pacific Life Insurance,Holiday Bowl
1420,1421,2003,12/30/2003,Tue,Washington State,15.0,28,Texas,5.0,20,61102.0,"WR Sammy Moore (Washington State), P Kyle Basl...",Pacific Life Insurance,Holiday Bowl
1429,1430,1994,12/30/1994,Fri,Michigan,20.0,24,Colorado State,10.0,14,59453.0,"QB Todd Collins (Michigan), QB Anthoney Hill (...",Thrifty Car Rental,Holiday Bowl


## Rows Based on Comparison with a Variable

In [104]:
# First, let's get the mean winner_points x 2
# .mean() is calculating the mean of the entire winner_points column
twiceTheMean = bowlData.winner_points.mean() * 2

# then, we can use that value to query for winners with more than twice the mean
bowlData.query("winner_points > @twiceTheMean")

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
10,11,2011,12/29/2011,Thu,Baylor,15.0,67,Washington,,56,65256.0,"RB Terrance Ganaway (Baylor), LB Elliot Coffey...",Valero Energy Corporation,Alamo Bowl
21,22,2000,12/30/2000,Sat,Nebraska,9.0,66,Northwestern,18.0,17,60028.0,"RB Dan Alexander (Nebraska), DL Kyle Vanden Bo...",Sylvania,Alamo Bowl
32,33,2018,12/22/2018,Sat,Army,22.0,70,Houston,,14,44738.0,"Army QB Kelvin Hopkins, Houston TE Romello Bro...",Lockhee Martin,Armed Forces Bowl
112,113,1999,12/31/1999,Fri,Colorado,,62,Boston College,25.0,28,35762.0,"RB Cortlen Johnson (Colorado), LB Jashon Sykes...",Insight Enterprises,Guaranteed Rate Bowl
316,317,2016,12/22/2016,Thu,Idaho,,61,Colorado State,,50,24975.0,Idaho QB Matt Linehan,Idaho Potato Commission,Famous Idaho Potato Bowl
362,363,1995,1/2/1996,Tue,Nebraska,1.0,62,Florida,2.0,24,79864.0,"QB Tommie Frazier (Nebraska), DB Michael Booke...",Tostitos,Fiesta Bowl
476,477,2014,1/4/2015,Sun,Toledo,,63,Arkansas State,,44,36811.0,RB Kareem Hunt (Toledo),GoDaddy,LendingTree Bowl
483,484,2007,1/6/2008,Sun,Tulsa,,63,Bowling Green,,7,36932.0,QB Paul Smith (Tulsa),GMAC,LendingTree Bowl
489,490,2001,12/19/2001,Wed,Marshall,,64,East Carolina,,61,40139.0,QB Byron Leftwich (Marshall),GMAC,LendingTree Bowl
657,658,2018,12/28/2018,Fri,Auburn,,63,Purdue,,14,59024.0,Auburn QB Jarrett Stidham,Franklin American Mortgage,Music City Bowl


## Multiple Criteria Query

In [105]:
# Find all the rows where the winner is Alabama, 
# AND Alabama had more than 50 points, 
# AND they weren't ranked number 1.
bowlData.query("(winner_tie == 'Alabama') and (winner_points > 50) and (winner_rank > 1)")

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
786,787,1952,1/1/1953,Thu,Alabama,8.0,61,Syracuse,16.0,6,66280.0,,,Orange Bowl


<div class="alert alert-info"> 
    
## Exercise 2: Selecting with .query()

Use ```.query()``` to find rows where the winner_tie column contains "State", the bowl_name contains "Rose", and the attendance is greater than 75,000.
</div>

In [106]:
# add your code here


<div class="alert alert-info"> 
    
Show programmatically how many rows your query found.
</div>

In [107]:
# add your code here

# V. Querying a Dataframe without .query()

Different syntax can be used to accomplish the same data queries we covered above without using ```.query()```. This syntax may be preferable if you don't have experience working with databases.

In [108]:
# find all rows where year is equal to 1901 using .query

# copy of code from the .query() section
# bowlData.query("year == 1901") 

# alternative query syntax
bowlData[bowlData.year == 1901]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
1020,1021,1901,1/1/1902,Wed,Michigan,,49,Stanford,,0,8000.0,RB Neil Snow (Michigan),,Rose Bowl


In [109]:
# find all rows where the bowl_name is "Rose Bowl"

# copy of code from the .query section
# bowlData.query("bowl_name == 'Rose Bowl'") 

# alternative query syntax
bowlData[bowlData.bowl_name == "Rose Bowl"]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
913,914,2021,1/1/2022,Sat,Ohio State,7.0,48,Utah,10.0,45,87842.0,Ohio State WR Jaxon Smith-Njigba,Capital One,Rose Bowl
914,915,2020,1/1/2021,Fri,Alabama,1.0,31,Notre Dame,4.0,14,18373.0,"Alabama WR DeVonta Smith, Alabama CB Patrick S...",Capital One,Rose Bowl
915,916,2019,1/1/2020,Wed,Oregon,7.0,28,Wisconsin,11.0,27,90462.0,"Oregon QB Justin Herbert, Oregon S Brady Breeze",Northwestern Mutual,Rose Bowl
916,917,2018,1/1/2019,Tue,Ohio State,5.0,28,Washington,9.0,23,91853.0,"Ohio State QB Dwayne Haskins, Ohio State S Bre...",Northwestern Mutual,Rose Bowl
917,918,2017,1/1/2018,Mon,Georgia,3.0,54,Oklahoma,2.0,48,92844.0,"Georgia RB Sony Michel, Georgia LB Roquan Smith",Northwestern Mutual,Rose Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1016,1017,1918,1/1/1919,Wed,Great Lakes Navy,,17,Mare Island Marines,,0,,WR George Halas (Great Lakes Navy),,Rose Bowl
1017,1018,1917,1/1/1918,Tue,Mare Island Marines,,19,Fort Lewis,,7,,RB Hollis Huntington (Mare Island Marines),,Rose Bowl
1018,1019,1916,1/1/1917,Mon,Oregon,,14,Pennsylvania,,0,27000.0,OL John Beckett (Oregon),,Rose Bowl
1019,1020,1915,1/1/1916,Sat,Washington State,,14,Brown,,0,10000.0,RB Carl Dietz (Washington State),,Rose Bowl


In [110]:
# Grab all rows where the word "State" appears in the winner_tie column

# copy of code from the .query section
# bowlData.query("winner_tie.str.contains('State')")

# alternative query syntax
bowlData[bowlData.winner_tie.str.contains('State')]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
11,12,2010,12/29/2010,Wed,Oklahoma State,16.0,36,Arizona,,10,57593.0,"WR Justin Blackmon (Oklahoma State), DB Markel...",Valero Energy Corporation,Alamo Bowl
14,15,2007,12/29/2007,Sat,Penn State,,24,Texas A&M,,17,66166.0,"RB Rodney Kinlaw (Penn State), LB Sean Lee (Pe...",Valero Energy Corporation,Alamo Bowl
17,18,2004,12/29/2004,Wed,Ohio State,24.0,33,Oklahoma State,,7,65265.0,"WR Ted Ginn Jr. (Ohio State), DL Simon Fraser ...",MasterCard,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1499,1500,2013,12/26/2013,Thu,Utah State,,21,Northern Illinois,24.0,14,23408.0,"RB Joey DeMartino (Utah State), LB Jake Dought...",San Diego County Credit Union,Poinsettia Bowl
1502,1503,2010,12/23/2010,Thu,San Diego State,,35,Navy,,14,48049.0,"RB Ronnie Hillman (San Diego State), WR Vincen...",San Diego County Credit Union,Poinsettia Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl


In [111]:
# find all rows where the loser ranked higher (smaller number) than the winner (larger number) using .query
# copy of code from the .query section
# bowlData.query("loser_rank < winner_rank")

# alternative query syntax
bowlData[bowlData.loser_rank < bowlData.winner_rank]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
7,8,2014,1/2/2015,Fri,UCLA,14.0,40,Kansas State,11.0,35,60517.0,"RB Paul Perkins (UCLA), LB Eric Kendricks (UCLA)",Valero,Alamo Bowl
13,14,2008,12/29/2008,Mon,Missouri,25.0,30,Northwestern,22.0,23,55986.0,"WR Jeremy Maclin (Missouri), LB Sean Witherspo...",Valero Energy Corporation,Alamo Bowl
26,27,1995,12/28/1995,Thu,Texas A&M,19.0,22,Michigan,14.0,20,64597.0,"K Kyle Bryant (Texas A&M), LB Keith Mitchell (...",Builders Square,Alamo Bowl
132,133,2021,1/1/2022,Sat,Kentucky,25.0,20,Iowa,17.0,17,50769.0,Kentucky WR Wan'Dale Robinson,Vrbo,Citrus Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,1417,2007,12/27/2007,Thu,Texas,17.0,52,Arizona State,12.0,34,64020.0,"QB Colt McCoy (Texas), DL Brian Orakpo (Texas)",Pacific Life Insurance,Holiday Bowl
1419,1420,2004,12/30/2004,Thu,Texas Tech,23.0,45,California,4.0,31,66222.0,"QB Sonny Cumbie (Texas Tech), DB Vincent Meeks...",Pacific Life Insurance,Holiday Bowl
1420,1421,2003,12/30/2003,Tue,Washington State,15.0,28,Texas,5.0,20,61102.0,"WR Sammy Moore (Washington State), P Kyle Basl...",Pacific Life Insurance,Holiday Bowl
1429,1430,1994,12/30/1994,Fri,Michigan,20.0,24,Colorado State,10.0,14,59453.0,"QB Todd Collins (Michigan), QB Anthoney Hill (...",Thrifty Car Rental,Holiday Bowl


In [112]:
# Rows Based on Comparison with a Variable

# copy of code from the .query section
# First, let's get the mean winner_points x 2
# twiceTheMean = bowlData.winner_points.mean() * 2
# then, we can use that value to query for winners with more than twice the mean
# bowlData.query("winner_points > @twiceTheMean")

# alternative query syntax
twiceTheMean = bowlData.winner_points.mean() * 2  # the same
bowlData[bowlData.winner_points > twiceTheMean]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
10,11,2011,12/29/2011,Thu,Baylor,15.0,67,Washington,,56,65256.0,"RB Terrance Ganaway (Baylor), LB Elliot Coffey...",Valero Energy Corporation,Alamo Bowl
21,22,2000,12/30/2000,Sat,Nebraska,9.0,66,Northwestern,18.0,17,60028.0,"RB Dan Alexander (Nebraska), DL Kyle Vanden Bo...",Sylvania,Alamo Bowl
32,33,2018,12/22/2018,Sat,Army,22.0,70,Houston,,14,44738.0,"Army QB Kelvin Hopkins, Houston TE Romello Bro...",Lockhee Martin,Armed Forces Bowl
112,113,1999,12/31/1999,Fri,Colorado,,62,Boston College,25.0,28,35762.0,"RB Cortlen Johnson (Colorado), LB Jashon Sykes...",Insight Enterprises,Guaranteed Rate Bowl
316,317,2016,12/22/2016,Thu,Idaho,,61,Colorado State,,50,24975.0,Idaho QB Matt Linehan,Idaho Potato Commission,Famous Idaho Potato Bowl
362,363,1995,1/2/1996,Tue,Nebraska,1.0,62,Florida,2.0,24,79864.0,"QB Tommie Frazier (Nebraska), DB Michael Booke...",Tostitos,Fiesta Bowl
476,477,2014,1/4/2015,Sun,Toledo,,63,Arkansas State,,44,36811.0,RB Kareem Hunt (Toledo),GoDaddy,LendingTree Bowl
483,484,2007,1/6/2008,Sun,Tulsa,,63,Bowling Green,,7,36932.0,QB Paul Smith (Tulsa),GMAC,LendingTree Bowl
489,490,2001,12/19/2001,Wed,Marshall,,64,East Carolina,,61,40139.0,QB Byron Leftwich (Marshall),GMAC,LendingTree Bowl
657,658,2018,12/28/2018,Fri,Auburn,,63,Purdue,,14,59024.0,Auburn QB Jarrett Stidham,Franklin American Mortgage,Music City Bowl


In [113]:
# Find all the rows where the winner is Alabama, 
# AND Alabama had more than 50 points, 
# AND they weren't ranked number 1.

# copy of code from the .query section
# bowlData.query("(winner_tie == 'Alabama') and (winner_points > 50) and (winner_rank > 1)")

# alternative query syntax
bowlData[(bowlData.winner_tie == 'Alabama') & (bowlData.winner_points > 50) & (bowlData.winner_rank > 1)]

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
786,787,1952,1/1/1953,Thu,Alabama,8.0,61,Syracuse,16.0,6,66280.0,,,Orange Bowl


Notice the difference here where we're using ```&``` to link multiple conditions together instead of how in the previous notebook Python Language Basics we learned how to use the boolean operator ```and``` to link multiple conditions. The ```&``` is called  "bitwise and" whereas python ```and``` is called a "logical and". Unless using ```.query()```, Pandas requires bitwise and, bitwise or, and bitwise not, which are written as: ```&```, ```|```, ```~```. Otherwise, you will get an error. 

<div class="alert alert-info"> 
    
## Exercise 3: Query without using .query()

Without using ```.query()``` repeat the query from exercise 2 (find rows where the winner_tie column contains "State", the bowl_name contains "Rose", and the attendance is greater than 75,000).

</div>

In [114]:
# add your code here


# VI. DataFrame Manipulation

## Summary Functions

Pandas offers a handful of summary functions that can be applied to a column or columns of a dataframe (series objects). These functions are:

- ```.sum()``` Sum values of each object. 
- ```.count()``` Count non-NA values of each object. 
- ```.median()``` Median value of each object. 
- ```.quantile([0.25,0.75])``` Quantiles of each object. 
- ```.apply(function)``` Apply function to each object. 
- ```.min()``` Minimum value in each object. 
- ```.max()``` Maximum value in each object. 
- ```.mean()``` Mean value of each object. 
- ```.var()``` Variance of each object. 
- ```.std()``` Standard deviation of each object.

We won't cover all of these, but let's try a few.

In [115]:
# the mean of winner_points
print("Winners had an average of ", bowlData['winner_points'].mean())

# the median of loser_points
print("Losers had a median of ", bowlData['winner_points'].median())

# lowest score of a winning team
print(f"The highest score of a losing team is {bowlData['loser_points'].max()}")

Winners had an average of  30.25343811394892
Losers had a median of  30.0
The highest score of a losing team is 61


Wow, 61 points and still a loss... ooof.

<div class="alert alert-info"> 
    
### Exercise 4: Find the standard deviation of a column

Create a print statement similar those above to print the standard deviation of the winner_points column and loser_points column. 
</div>

In [116]:
# add your code here


## Sorting Data

Let's begin with a simple sort on a numeric data column with Pandas ```.sort_values()```. This function will sort numeric data in descending order by default unless you provide the parameter ```ascending=True```. 

First, we'll double check that the year column of the dataframe is numeric (we'll get into why we're doing this in a minute). Then we'll sort by year ascending and return only the year and winner_points columns.

In [117]:
# see if year column is numeric
bowlData.year.dtype

dtype('int64')

Excellent, we should be good to proceed.

In [118]:
# sort specific columns by year
bowlData[['year', 'winner_points']].sort_values(by = ['year'], ascending = True)

Unnamed: 0,year,winner_points
1020,1901,49
1019,1915,14
1018,1916,14
1017,1917,19
1016,1918,17
...,...,...
309,2021,38
655,2021,48
678,2021,30
216,2021,27


Now, let's do a more complex sort. Sort by winner_points descending to find the highest score by an upset winner (winner ranked lower than loser).

We'll use the columns winner_rank and loser_rank to accomplish this sort (which we have already converted to numeric data) as well as the winner_points column.

Let's double check the data type of the winner_points column before we sort.

In [119]:
# see if winner_points column is numeric
bowlData.winner_points.dtype

dtype('int64')

Great, another numeric data column.

In [120]:
# sort by winner_points to find the highest score by an upset winner
bowlData[bowlData.winner_rank > bowlData.loser_rank].sort_values(by = ['winner_points'], ascending = False)

Unnamed: 0,id,year,date,day,winner_tie,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,mvp,sponsor,bowl_name
727,728,2011,1/4/2012,Wed,West Virginia,23.0,70,Clemson,14.0,33,67563.0,QB Geno Smith (West Virginia),Discover Financial,Orange Bowl
920,921,2014,1/1/2015,Thu,Oregon,3.0,59,Florida State,2.0,20,91322.0,"QB Marcus Mariota (Oregon), LB Tony Washington...",Northwestern Mutual,Rose Bowl
917,918,2017,1/1/2018,Mon,Georgia,3.0,54,Oklahoma,2.0,48,92844.0,"Georgia RB Sony Michel, Georgia LB Roquan Smith",Northwestern Mutual,Rose Bowl
1078,1079,1996,1/2/1997,Thu,Florida,3.0,52,Florida State,1.0,20,78344.0,QB Danny Wuerffel (Florida),Nokia,Sugar Bowl
344,345,2013,1/1/2014,Wed,Central Florida,15.0,52,Baylor,6.0,42,65172.0,"QB Blake Bortles (UCF), LB Terrance Plummer (UCF)",Tostitos,Fiesta Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,299,1939,1/1/1940,Mon,Clemson,15.0,6,Boston College,14.0,3,20000.0,B Banks McFadden (Clemson),,Cotton Bowl
1153,1154,2008,12/31/2008,Wed,Oregon State,24.0,3,Pittsburgh,18.0,0,49037.0,DL Victor Strong-Butler (Oregon State),Brut,Sun Bowl
456,457,1957,12/28/1957,Sat,Tennessee,18.0,3,Texas A&M,4.0,0,41160.0,"RB Bobby Gordon (Tennessee), RB John David Cro...",,Gator Bowl
1133,1134,1941,1/1/1942,Thu,Fordham,8.0,2,Missouri,7.0,0,72000.0,,,Sugar Bowl


Notice that our code returned a lot of rows but we'll find the answer to our question in the first row in the winner_points column: 70.

How would we do the same sort but return only the highest score by an upset winner as opposed to all the data rows that were returned in the code above?

Since we are sorting with ```ascending = False```, the highest winner_points is located in the first row of returned data. In this case, the index of the first row is 727. We don't know ahead of time what the index value of the first row of results with be though. Don't worry! We can grab the first row of the sorted results using the integer row position 0 as opposed to using the row index by using ```.iloc[]```.

From the results above we expect the output of the following code to be 70

In [121]:
# sort by the winner_points to find the highest score by an upset winner
bowlData[bowlData.winner_rank > bowlData.loser_rank].sort_values(by = ['winner_points'], ascending = False).iloc[0].loc['winner_points']

70

Wow! What just happened? We used ```.iloc[0]``` to select only the first row of the results returned by sorting and ```.loc['winner_points']``` to return the value of the winner_points column from the results return by ```.iloc[0]```. 

Are you beginning to see the power of Pandas? We can string together many functions in a row to achieve what we're looking for.

Does sorting work on strings? 

Yes, but you have to be careful! If your strings only contain letters, ```.sort_values()``` will sort from A to Z or Z to A as you'd expect. 

But, if your strings contain numbers or letters and numbers together, ```.sort_values()``` may not return the sort order that you're expecting. This is because Pandas sorts strings character by character using lexicographic order. This is also why we've been double checking that our columns with numbers are numeric and not object data type. 

Before we move on, let's work through a quick example of what happens if you sort strings that contain numbers.

In [122]:
# create a dataframe with a column of data that looks like numbers, but are actually strings
df = pd.DataFrame({'ranking': ['1', '3', '5', '2', '4', '10']})
print(df.ranking.dtype)
df

object


Unnamed: 0,ranking
0,1
1,3
2,5
3,2
4,4
5,10


The ranking column that looks like numeric data is actually strings. What happens when we sort ascending?

In [123]:
df.sort_values(by = ['ranking'], ascending = True)

Unnamed: 0,ranking
0,1
5,10
3,2
1,3
4,4
2,5


This is just something to be aware of. If you want to keep your numerical-looking data as strings but sort in numerical order, there would be some extra steps to execute. We won't cover that here but if you're interested you can find plenty of information on how to accomplish that on the web. For example, [this user question and answer on stackoverflow.com](https://stackoverflow.com/questions/37693600/how-to-sort-dataframe-based-on-particular-stringcolumns-using-python-pandas)

## Grouping Data

```.groupby()``` will partition data into groups for you, which you can then operate on using functions like```.mean()```, ```.sum()```, etc. Let's start with a super simple example before we try to use ```.groupby()``` on our ```bowlData```.

In [124]:
# a new very simple dataframe 
# pretend we have been observing two falcons and two parrots
# and we are keeping data records on their maximum observed flight speed in miles per hour
birds = pd.DataFrame({'species' : ['falcon', 'falcon', 'parrot', 'parrot'],
                   'individual' : ['f01', 'f02', 'squawky', 'pretty boy'],
                   'age_class' : ['adult', 'adult', 'juvenile', 'adult'],
                   'max_speed_mph' : [230., 240., 35., 40.]})
birds

Unnamed: 0,species,individual,age_class,max_speed_mph
0,falcon,f01,adult,230.0
1,falcon,f02,adult,240.0
2,parrot,squawky,juvenile,35.0
3,parrot,pretty boy,adult,40.0


We can use ```.groupby``` to find the average max speed of each species if we group the dataframe rows by the species column and then for each group of rows (there will be two groups: falcon rows and parrot rows) take the mean of the max_speed_mph column.

In [125]:
birds.groupby("species")["max_speed_mph"].mean()

species
falcon    235.0
parrot     37.5
Name: max_speed_mph, dtype: float64

Nice! The average max speed of all the falcons is 235 mph and the average max speed of all the parrots is 37.5 mph.

Pandas ```.groupby()``` will work similarly on our much larger ```bowlData``` dataframe.

Let's group by winner_tie (which is the name of the winning team) and then find the average numer of points the winning team scored (winner_points).

In [126]:
avgPointsByWinners = bowlData.groupby("winner_tie")["winner_points"].mean()
avgPointsByWinners

winner_tie
Air Force             29.400000
Akron                 23.000000
Alabama               29.111111
Alabama-Birmingham    34.000000
Appalachian State     38.000000
                        ...    
Western Reserve       26.000000
William & Mary        20.000000
Wisconsin             27.055556
Wyoming               29.888889
Xavier                33.000000
Name: winner_points, Length: 166, dtype: float64

With the ```bowlData```, it is a little more obvious to see the default behaviour of the ```.groupby()``` with regard to how the function sorts the result. ```.groupby()``` will sort ascending on the grouping column (here, winner_tie) when returning the result, which is why our result is sorted by winner_tie from A to Z. This is the default behavior unless you specify ```sort=False``` as a parameter in ```.groupby()```. 

<div class="alert alert-info"> 

### Exercise 5: Using .groupby() 

Find the maximum observed flight speed for each species in the birds dataframe.
</div>

In [127]:
# add your code here


<div class="alert alert-info"> 
This next one is challenging and uses a function we haven't covered yet. See if you can work it out!

For each species, find the name of the fastest individual. 

Hint: you will need to use ```.loc()```, ```.groupby()```, and also a function called ```.idxmax()``` (in that order). See if you can figure it out with some help from the web. The doc page for [.idxmax()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html) and [this user question and answer(s) from stackoverflow.com](https://stackoverflow.com/questions/39964558/pandas-max-value-index) may get you part of the way to the answer.
</div>

In [128]:
# add your code here


## Renaming Columns

To change column names, the easiest way is this:

In [129]:
# create a dictionary where key is the old name and value is the new name
columnMap = {"mvp" : "most_valuable_player", "winner_tie" : "winner"}

bowlData = bowlData.rename(columns = columnMap, errors = "raise")

bowlData

Unnamed: 0,id,year,date,day,winner,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,most_valuable_player,sponsor,bowl_name
0,1,2021,12/29/2021,Wed,Oklahoma,14.0,47,Oregon,15.0,32,59121.0,"Oklahoma RB Kennedy Brooks, Oklahoma S Pat Fields",Valero,Alamo Bowl
1,2,2020,12/29/2020,Tue,Texas,20.0,55,Colorado,,23,10822.0,"Texas RB Bijan Robinson, Texas LB DeMarvion Ov...",Valero,Alamo Bowl
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12.0,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
4,5,2017,12/28/2017,Thu,Texas Christian,13.0,39,Stanford,15.0,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20.0,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl


## Handling Missing Values

Sometimes you will need to assess how much of your data is missing to make sure any statistics you apply to your data are robust. 

Sometimes having the missing data value (NaN) in your data is beneficial. Many functions will simply ignore missing data or propagate missing data values. For example, if you were to add one data value to another and the values were NaN + 5, the result would be NaN. This is often desirable.

But in some cases, like for various machine learning techniques, you may need to ensure that there are no missing data values present in your data at all. In this case you may need to fill the missing data values with a different number or drop rows with missing data from your dataframe entirely.

Let's look at the Pandas functions that can help us with missing data.

In [130]:
# show how many missing values are present in each column
bowlData.isna().sum()

id                        0
year                      0
date                      0
day                       0
winner                    0
winner_rank             774
winner_points             0
loser_tie                 0
loser_rank              857
loser_points              0
attendance                9
most_valuable_player    169
sponsor                 707
bowl_name                 0
dtype: int64

In [131]:
# look at all the rows where winner_rank is missing
bowlData[bowlData['winner_rank'].isna()]

Unnamed: 0,id,year,date,day,winner,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,most_valuable_player,sponsor,bowl_name
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12.0,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl
9,10,2012,12/29/2012,Sat,Texas,,31,Oregon State,15.0,27,65277.0,"WR Marquise Goodwin (Texas), DL Alex Okafor (T...",Valero Energy Corporation,Alamo Bowl
12,13,2009,1/2/2010,Sat,Texas Tech,,41,Michigan State,,31,64757.0,"QB Taylor Potts (Texas Tech), DB Jamar Wall (T...",Valero Energy Corporation,Alamo Bowl
14,15,2007,12/29/2007,Sat,Penn State,,24,Texas A&M,,17,66166.0,"RB Rodney Kinlaw (Penn State), LB Sean Lee (Pe...",Valero Energy Corporation,Alamo Bowl
16,17,2005,12/28/2005,Wed,Nebraska,,32,Michigan,20.0,28,62000.0,"RB Cory Ross (Nebraska), DB Leon Hall (Michigan)",MasterCard,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20.0,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl


In [132]:
# replace all missing values with 0 
# notice we are making a copy of our data by saving the dataframe to a new variable here
betterBowlData = bowlData.fillna(0)

# let's see if any missing values are still present - shouldn't be
betterBowlData.isna().sum()

id                      0
year                    0
date                    0
day                     0
winner                  0
winner_rank             0
winner_points           0
loser_tie               0
loser_rank              0
loser_points            0
attendance              0
most_valuable_player    0
sponsor                 0
bowl_name               0
dtype: int64

Let's pause for a minute to think about what we just did. We filled all NaN with 0. Does that make sense? Does it make sense to have winner_points = 0 or sponsor = 0? You can see how filling NaN with another value may end up being confusing to you later. So think carefully if you really need to replace missing data values, or if filling with another value like 0 will work for your analysis. 

Instead of filling NaN with a different value, you may need to drop full data rows if there are any missing values present. This is how you would do that:

In [133]:
# drop all rows where at least 1 column of data is NaN
lessBowlData=bowlData.dropna(how = 'any')
lessBowlData

Unnamed: 0,id,year,date,day,winner,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,most_valuable_player,sponsor,bowl_name
0,1,2021,12/29/2021,Wed,Oklahoma,14.0,47,Oregon,15.0,32,59121.0,"Oklahoma RB Kennedy Brooks, Oklahoma S Pat Fields",Valero,Alamo Bowl
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl
4,5,2017,12/28/2017,Thu,Texas Christian,13.0,39,Stanford,15.0,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl
5,6,2016,12/29/2016,Thu,Oklahoma State,13.0,38,Colorado,11.0,8,59815.0,"OSU WR James Washington, OSU DT Vincent Taylor",Valero,Alamo Bowl
6,7,2015,1/2/2016,Sat,Texas Christian,11.0,47,Oregon,15.0,41,64569.0,"TCU QB Brian Kohlhausen, TCU S Travin Howard",Valero,Alamo Bowl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1427,1428,1996,12/30/1996,Mon,Colorado,8.0,33,Washington,13.0,21,54749.0,"QB Koy Detmer (Colorado), DL Nick Ziegler (Col...",Plymouth,Holiday Bowl
1429,1430,1994,12/30/1994,Fri,Michigan,20.0,24,Colorado State,10.0,14,59453.0,"QB Todd Collins (Michigan), QB Anthoney Hill (...",Thrifty Car Rental,Holiday Bowl
1434,1435,1989,12/29/1989,Fri,Penn State,18.0,50,Brigham Young,19.0,39,61113.0,"RB Blair Thomas (Penn State), QB Ty Detmer (Br...",SeaWorld,Holiday Bowl
1435,1436,1988,12/30/1988,Fri,Oklahoma State,12.0,62,Wyoming,15.0,14,60718.0,"RB Barry Sanders (Oklahoma State), LB Sim Drai...",SeaWorld,Holiday Bowl


When we first loaded bowlData, we originally had 1527 rows of data and now after dropping all rows that have at least one missing value, we are left with only 269 rows of data.

If the goal was to drop all rows where all columns contain the missing data value, you could use ```.dropna(how='all')```.

## Creating New Columns Derived from Existing Columns

Pandas allows us to easily use existing columns to calculate new data and save the calculations into a new column in the dataframe. 

Here's an example. Let's define a blowout as when the winning team beats the losing team by 21 or more points. The task now is to create a new column in our dataframe that indicates whether each game (row) in our dataset was a blowout. We'll fill the column with values of True or False.

In [134]:
bowlData['blowout'] = bowlData.winner_points - bowlData.loser_points >= 21
bowlData

Unnamed: 0,id,year,date,day,winner,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,most_valuable_player,sponsor,bowl_name,blowout
0,1,2021,12/29/2021,Wed,Oklahoma,14.0,47,Oregon,15.0,32,59121.0,"Oklahoma RB Kennedy Brooks, Oklahoma S Pat Fields",Valero,Alamo Bowl,False
1,2,2020,12/29/2020,Tue,Texas,20.0,55,Colorado,,23,10822.0,"Texas RB Bijan Robinson, Texas LB DeMarvion Ov...",Valero,Alamo Bowl,True
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12.0,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl,True
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl,False
4,5,2017,12/28/2017,Thu,Texas Christian,13.0,39,Stanford,15.0,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl,False
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl,False
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl,False
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20.0,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl,False


It's that simple! There is no looping required. Pandas is smart enough to do the subtraction row by row and fill the result in the appropriate place all on its own. Writing a loop would be much slower, which is the case for many Python packages. It's most efficient to use the built-in capabilities of whatever packages you're working with. Try to avoid looping wherever you can.

<div class="alert alert-info"> 
    
### Exercise 6: Copy an existing column to a new column

Modify the dataframe bowlData by copying the year column to a new column called year_copy. Print bowlData to the screen to check your work.
</div>

In [135]:
# add your code here


## Merging DataFrames Together

Let's pretend I have additonal college football bowl data hanging out in a separate csv file. The additional data has the same 'id' information (row index value) as in collegefootballbowl.csv, but not all id's (1-1527) are present. How do I join this additional data to the ```bowlData``` dataframe?

In [136]:
# load the data
moreBowlData = pd.read_csv('data/morecollegefootballbowldata.csv')
moreBowlData

Unnamed: 0,id,tickets_sold,best_selling_concession
0,1,86159,hot dog
1,2,100010,nachos
2,5,45924,french fries


Oh wow, that's not much data! But let's merge it into ```bowlData``` anyway. The merge column will be the id column since that is the only common column between the two data files.

In [137]:
allBowlData = pd.merge(bowlData, moreBowlData, how = 'outer', on = 'id')
allBowlData

Unnamed: 0,id,year,date,day,winner,winner_rank,winner_points,loser_tie,loser_rank,loser_points,attendance,most_valuable_player,sponsor,bowl_name,blowout,tickets_sold,best_selling_concession
0,1,2021,12/29/2021,Wed,Oklahoma,14.0,47,Oregon,15.0,32,59121.0,"Oklahoma RB Kennedy Brooks, Oklahoma S Pat Fields",Valero,Alamo Bowl,False,86159.0,hot dog
1,2,2020,12/29/2020,Tue,Texas,20.0,55,Colorado,,23,10822.0,"Texas RB Bijan Robinson, Texas LB DeMarvion Ov...",Valero,Alamo Bowl,True,100010.0,nachos
2,3,2019,12/31/2019,Tue,Texas,,38,Utah,12.0,10,60147.0,"Texas QB Sam Ehlinger, Texas LB Joseph Ossai",Valero,Alamo Bowl,True,,
3,4,2018,12/28/2018,Fri,Washington State,12.0,28,Iowa State,25.0,26,60675.0,"Washington State QB Gardner Minshew, Washingto...",Valero,Alamo Bowl,False,,
4,5,2017,12/28/2017,Thu,Texas Christian,13.0,39,Stanford,15.0,37,57653.0,"TCU QB Kenny Hill, TCU LB Travin Howard",Valero,Alamo Bowl,False,45924.0,french fries
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1522,1523,2004,12/30/2004,Thu,Northern Illinois,,34,Troy,,21,21456.0,"RB Dewhitt Betterson (Troy), DB Lionel Hickenb...",,Silicon Valley Bowl,False,,
1523,1524,2003,12/30/2003,Tue,Fresno State,,17,UCLA,,9,20126.0,"RB Rodney Davis (Fresno State), DL Garrett McI...",,Silicon Valley Bowl,False,,
1524,1525,2002,12/31/2002,Tue,Fresno State,,30,Georgia Tech,,21,10132.0,"RB Rodney Davis (Fresno State), DL Jason Stewa...",,Silicon Valley Bowl,False,,
1525,1526,2001,12/31/2001,Mon,Michigan State,,44,Fresno State,20.0,35,30456.0,"WR Charles Rogers (Michigan State), DL Nick My...",,Silicon Valley Bowl,False,,


We can see that for the 3 id's where we have tickets_sold and best_selling_concession data, those data values now appear in the merged dataframe. And everywhere else in those two columns got filled with NaN. There are tons of ways to merge dataframes. Check out the [pd.merge() API reference](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) for more information.

# VII. Writing a DataFrame to a File

Pandas has a function ```.to_csv()``` for writing a dataframe to a .csv file.

Let's write our bowl data to a file.

In [None]:
bowlData.to_csv(r'data/bowlData.csv', index = None, header=True)

Locate the file you just wrote, open it, and see what it looks like. If you open the file in JupyterLab it will look like a spreadsheet. If you open the file with any other text editor, you should see the comma separated header and data values.

# VIII. Pandas Datetimes

Datetimes are a special type of object that represent a date and a time. Python has a built-in datetime data type that we did not previously cover because Pandas' datatimes offer much more functionality. In this section we'll cover the Pandas datetime objects ```Timestamp``` and ```DatetimeIndex```.

X- creating a single and multiple datetime(s) from string (to_datetime)
X- creating a date range
X- extracting parts of the date time
- math with dates (adding, differencing, comparing)
- time indexed data
- resampling
- groupby with dates

## Creating Datetimes with Pandas

The string format of a datetime looks like ```YYYY-MM-DD hh:mm:ss.ns```. The date part of the object in years, months, and days comes before the space. The time part comes after the space in hours, minutes, seconds, and seconds fraction. The highest precision of a Pandas datetime object is nanoseconds but not all datetimes need to be that precise. 

Pandas stores single datetimes as Timestamp objects and sequences of datetimes as DatetimeIndex objects. We can create a Timestamp object either with ```pd.Timestamp()``` or ```pd.to_datetime()```. We can create a DatetimeIndex object either with ```pd.to_datetime()``` or ```pd.date_range()```.

Let's create our first datetimes. The Pandas functions for creating Timestamps can accept date inputs in a range of formats. 

<!-- by specifying the dtype parameter as ```datetime64[precision]``` where precision can equal `Y`, `M`, `D`, `h`, `m`, `s`, `ms`. A full list of all possible precision units can be found in the [NumPy API Reference for Datetimes](https://numpy.org/doc/stable/reference/arrays.datetime.html#arrays-dtypes-dateunits).
 -->


In [285]:
# convert an indivual date string into a Pandas Timestamp object
# all of these will result in the same Timestamp object

print('pd.to_datetime()')
print(pd.to_datetime('2025-01-01')) 
print(pd.to_datetime('2025/01/01')) 
print(pd.to_datetime('1/1/2025')) 
print(pd.to_datetime('2025.01.01')) 
print(pd.to_datetime('Jan 1, 2025')) 
print(pd.to_datetime('20250101')) 

print('pd.Timestamp()')
print(pd.Timestamp('2025-01-01')) 
print(pd.Timestamp('2025/01/01')) 
print(pd.Timestamp('1/1/2025')) 
print(pd.Timestamp('2025.01.01')) 
print(pd.Timestamp('Jan 1, 2025')) 
print(pd.Timestamp('20250101')) 
print(pd.Timestamp(2025,1,1)) 

pd.to_datetime()
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
pd.Timestamp()
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00
2025-01-01 00:00:00


There are some differences between these two functions in what types of date inputs can be accepted though. See the Pandas API Reference for [```pd.Timestamp()```](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html) and [```pd.to_datetime()```](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) for details.

Notice how Timestamp objects will always display like ```YYYY-MM-DD hh:mm:ss``` even if you don't provide hours, minutes, and seconds, they will default to zero. If you leave off the month or day, Pandas will default to the first month and first day.

In [54]:
print(pd.Timestamp('2025-01')) # precision = month
print(pd.Timestamp('2025')) # precision = month

2025-01-01 00:00:00
2025-01-01 00:00:00


Inputting a sequence of dates into ```pd.to_datetime()``` will return a DatetimeIndex object containing data of type datetime64. The ```[ns]``` next to the data type below indicates the precision of the datetime is nanoseconds.

In [64]:
dates = ['2025-01-15','2025-03-12','2025-10-02','2025-02-28','2025-12-20']
pd.to_datetime(dates)

DatetimeIndex(['2025-01-15', '2025-03-12', '2025-10-02', '2025-02-28',
               '2025-12-20'],
              dtype='datetime64[ns]', freq=None)

We can create a sequence of datetimes from a starting point to an ending point with the ```pd.date_range()``` function. The frequency parameter lets us easily create sequences of datetimes spaced at intervals. 

Below we create DatetimeIndex objects with daily, monthly, and 6-hourly frequency. All the available frequency options are listed in the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases). 

In [69]:
# create daily datetimes (D = daily)
print(pd.date_range('2025-01-01','2025-01-31',freq='D')) 

# create monthly datetimes at the start of each month (MS = month start)
print(pd.date_range('2025-01-01','2025-12-31',freq='MS'))

# create hourly datetimes every 6 hours (6h = 6-hourly)
print(pd.date_range('2025-01-01 00', '2025-01-01 18',freq='6h'))

DatetimeIndex(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04',
               '2025-01-05', '2025-01-06', '2025-01-07', '2025-01-08',
               '2025-01-09', '2025-01-10', '2025-01-11', '2025-01-12',
               '2025-01-13', '2025-01-14', '2025-01-15', '2025-01-16',
               '2025-01-17', '2025-01-18', '2025-01-19', '2025-01-20',
               '2025-01-21', '2025-01-22', '2025-01-23', '2025-01-24',
               '2025-01-25', '2025-01-26', '2025-01-27', '2025-01-28',
               '2025-01-29', '2025-01-30', '2025-01-31'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01',
               '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01',
               '2025-09-01', '2025-10-01', '2025-11-01', '2025-12-01'],
              dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2025-01-01 00:00:00', '2025-01-01 06:00:00',
               '2025-01-01 12:00:00', '2025-01-01 18:00:00'],
           

As we see above with 6 hourly frequency, an integer can be added into the frequency string to create datetimes spaced at multiples of hours (or minutes, days, months, etc).

## Datetime Properties

Datetimes have components and properties that we can access by calling various methods directly on the datetime object. The full list of datetime properties is located in the [Pandas User Guide section on timeseries](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components).  

In [263]:
# a Timestamp object for demonstration
one_timestamp = pd.to_datetime('2025-05-01')
print(one_timestamp)

# a DatetimeIndex object for demonstration
datetime_index = pd.date_range('2025-01-01','2025-12-31',freq='MS')
print(datetime_index)

2025-05-01 00:00:00
DatetimeIndex(['2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01',
               '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01',
               '2025-09-01', '2025-10-01', '2025-11-01', '2025-12-01'],
              dtype='datetime64[ns]', freq='MS')


In [264]:
# accessing properties of a Timestamp object
print(one_timestamp)
print('-------------------')
print('year', one_timestamp.year)
print('month', one_timestamp.month)
print('day', one_timestamp.day)
print('hour', one_timestamp.hour)
print('day of year', one_timestamp.dayofyear)
print('day of week', one_timestamp.dayofweek)
print('quarter', one_timestamp.quarter)

2025-05-01 00:00:00
-------------------
year 2025
month 5
day 1
hour 0
day of year 121
day of week 3
quarter 2


In [265]:
# accessing properties of a DatetimeIndex object
print(datetime_index)
print('--------------------------------------------------------------------------------')
print('year', datetime_index.year)
print('month', datetime_index.month)
print('day', datetime_index.day)
print('hour', datetime_index.hour)
print('day of year', datetime_index.dayofyear)
print('day of week', datetime_index.dayofweek)
print('quarter', datetime_index.quarter)

DatetimeIndex(['2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01',
               '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01',
               '2025-09-01', '2025-10-01', '2025-11-01', '2025-12-01'],
              dtype='datetime64[ns]', freq='MS')
--------------------------------------------------------------------------------
year Index([2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025], dtype='int32')
month Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype='int32')
day Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype='int32')
hour Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype='int32')
day of year Index([1, 32, 60, 91, 121, 152, 182, 213, 244, 274, 305, 335], dtype='int32')
day of week Index([2, 5, 5, 1, 3, 6, 1, 4, 0, 2, 5, 0], dtype='int32')
quarter Index([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype='int32')


Timestamp and DatetimeIndex objects can also exist inside of Pandas series and dataframe structures. A DatetimeIndex could be a column of your dataframe (series) or you could use it to time-index your dataframe (which we'll cover in the next subsection). If your datetimes are a series in a dataframe, we can access the object properties through the ```.dt``` accessor.

In [174]:
# make DatetimeIndex a series in a dataframe for demonstration
df = pd.DataFrame(datetime_index,columns=['DATE'])
df.head()

Unnamed: 0,DATE
0,2025-01-01
1,2025-02-01
2,2025-03-01
3,2025-04-01
4,2025-05-01


In [175]:
# access components of datetimes when they are stored in a series
# a series is returned
df.DATE.dt.year

0     2025
1     2025
2     2025
3     2025
4     2025
5     2025
6     2025
7     2025
8     2025
9     2025
10    2025
11    2025
Name: DATE, dtype: int32

## Timedeltas for Math with Dates

Timedelta objects (data type timedelta64) are differences in datetimes, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.

We can use timedeltas to add or subtract a fixed amount of time from a datetime.

In [275]:
# create timedelta object
delta = pd.Timedelta('6h')
delta

Timedelta('0 days 06:00:00')

In [276]:
# add timedelta to datetimes
datetime_index + delta

DatetimeIndex(['2025-01-01 06:00:00', '2025-02-01 06:00:00',
               '2025-03-01 06:00:00', '2025-04-01 06:00:00',
               '2025-05-01 06:00:00', '2025-06-01 06:00:00',
               '2025-07-01 06:00:00', '2025-08-01 06:00:00',
               '2025-09-01 06:00:00', '2025-10-01 06:00:00',
               '2025-11-01 06:00:00', '2025-12-01 06:00:00'],
              dtype='datetime64[ns]', freq=None)

In [277]:
# subtract timedelta from datetimes
datetime_index - delta

DatetimeIndex(['2024-12-31 18:00:00', '2025-01-31 18:00:00',
               '2025-02-28 18:00:00', '2025-03-31 18:00:00',
               '2025-04-30 18:00:00', '2025-05-31 18:00:00',
               '2025-06-30 18:00:00', '2025-07-31 18:00:00',
               '2025-08-31 18:00:00', '2025-09-30 18:00:00',
               '2025-10-31 18:00:00', '2025-11-30 18:00:00'],
              dtype='datetime64[ns]', freq=None)

Notice how easy it was to subtract 6 hours from these datetimes. If we kept our dates as strings instead of datetimes, this task would require us to write a significant amount of code. Datetimes and timedeltas allow us to use simple addition and subtraction instead of having to code up something much more complicated.

## Working with Time-Indexed Data

One of the most powerful applications of Pandas datetimes is for time-indexed data. This means using the DatetimeIndex object as the index in a dataframe. This would be appropriate for data that occur over time where time has significant meaning to the data values, like daily observations of tornado occurrences, for example, or any other timeseries of data.

For this section we'll work with daily severe weather counts (tornados, severe wind, and severe hail) for the state of Mississippi. This data was obtained for the years 2004-2023 from the [NOAA National Weather Service Storm Prediction Center website](https://www.spc.noaa.gov/climo/summary/) and compiled into a single data file ```data/NOAA_SevereWeather/NOAA-SPC_SevereWxCounts_MS_2004-2023.csv``` for our use here.

Some of the things we can do with time-indexed data like the severe weather counts are:
- resampling in time, e.g., daily counts --> annual counts
- grouping in time, e.g., find the long-term average number of tornados that occur in each month
- math with dates, e.g., find the number of days between tornado occurrences

First, we'll load the data into a dataframe. We can tell Pandas to create datetimes through the use of parameters in the ```pd.read_csv()``` function. To use the "Date" column of data as the index of the dataframe we can use the parameter ```index_col='Date'```. To turn the dates into datetimes (a DatetimeIndex) we can use the parameter ```parse_dates=['Date']```.

In [281]:
# read data into dataframe, converting dates to datetimes and using them as the df index
wx_df = pd.read_csv('data/NOAA_SevereWeather/NOAA-SPC_SevereWxCounts_MS_2004-2023.csv', 
                 usecols=['Date','Tornado','Wind','Hail'], 
                 parse_dates = ['Date'],
                 index_col='Date')
wx_df

Unnamed: 0_level_0,Tornado,Wind,Hail
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-01-01,0,0,0
2004-01-02,0,0,0
2004-01-03,0,0,0
2004-01-04,0,0,0
2004-01-05,0,0,0
...,...,...,...
2023-12-27,0,0,0
2023-12-28,0,0,0
2023-12-29,0,0,0
2023-12-30,0,0,0


There are 20 years of data: 20 * 365 days + 5 leap days = 7305 rows of data.

All the data columns are counts of severe weather occurrences, so if there are no messy data values then Pandas should have assigned those columns an integer data type. Let's double check the type of the data columns and index.

In [282]:
# look at data types
print(wx_df.dtypes)
type(wx_df.index)

Tornado    int64
Wind       int64
Hail       int64
dtype: object


pandas.core.indexes.datetimes.DatetimeIndex

We now have a time-indexed dataframe of daily counts of severe weather occurrences in the state of Mississippi from 2004-2023. 

### Resampling in Time

Because the index of our dataframe is datetimes, we can easily resample the data in time, e.g., daily counts --> annual counts.

Pandas dataframe ```.resample()``` works similarly to ```.groupby()```. ```.groupby()``` works on a column of data given a certain condition, whereas ```.resample()``` works on a DatetimeIndex given a certain time alias. Here, we use the alias ```'A'``` for annual. The list of resampling alias options can be found in the [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#period-aliases)

In [283]:
# resample daily to annual counts
df_annual = wx_df.resample('A').sum()
df_annual

Unnamed: 0_level_0,Tornado,Wind,Hail
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-12-31,55,403,153
2005-12-31,112,466,448
2006-12-31,33,401,344
2007-12-31,33,450,223
2008-12-31,111,826,469
2009-12-31,46,521,323
2010-12-31,45,478,300
2011-12-31,87,661,368
2012-12-31,48,486,202
2013-12-31,30,255,146


Notice that just like with ```.groupby()``` we need to also apply some sort of mathematical function like ```.sum()```.

<div class="alert alert-info"> 

### Exercise 7: Resample Data Using Datetimes

Resample ```wx_df``` to obtain a dataframe where each row contains the sum of one month of each type severe weather observation (your result should retain the columns Tornado, Wind, and Hail).
</div>

In [284]:
# add your code here

wx_df.resample('M').sum()

Unnamed: 0_level_0,Tornado,Wind,Hail
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-01-31,0,1,0
2004-02-29,3,17,15
2004-03-31,1,17,7
2004-04-30,2,17,35
2004-05-31,6,47,12
...,...,...,...
2023-08-31,0,25,2
2023-09-30,0,40,6
2023-10-31,0,0,0
2023-11-30,0,12,1


### Grouping in Time

We can also use datetimes to easily group in time to calculate things like the long-term average number of occurences of each type of severe weather per month of the year.

First, we'll programatically get the number of data years in the dataset by accessing the ```.year``` property of the DatetimeIndex and then using the pandas function ```.nunique()``` to get the set of unique years in the data (total number of years).

In [273]:
# programmatically determine number of years in the dataset
nyears = df.index.year.nunique()
nyears

20

Now, we can group the entire dataset by month and divide by the total number of years in the dataset to get the average number of occurences of each type of severe weather per month of the year. 

In [274]:
# calculate long term monthly means
df_monthly_mean = df.groupby(df.index.month).sum() / nyears
df_monthly_mean

Unnamed: 0_level_0,Tornado,Wind,Hail
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4.9,27.5,8.65
2,4.45,25.35,14.85
3,8.15,47.3,39.85
4,17.9,86.55,57.95
5,4.0,57.1,32.75
6,1.5,86.0,25.1
7,0.55,48.4,7.95
8,2.5,41.3,7.15
9,4.7,11.25,2.35
10,2.3,15.2,5.3


### Differencing Dates

Another very useful aspect of datetimes is how we can difference them. When we difference two datetimes, then result is a timedelta object. Let's look at an example. We'll difference consecutive dates of tornado occurrences to get the number of days between tornados. First, let's subset our daily severe weather data to only the tornado data and drop all records when there were no tornados.

In [278]:
# new dataframe with only the index and Tornado column
tornados = df[['Tornado']]

# drop all rows with zero tornados
tornados = tornados.query('Tornado != 0')

tornados

Unnamed: 0_level_0,Tornado
Date,Unnamed: 1_level_1
2004-02-05,3
2004-03-05,1
2004-04-29,2
2004-05-01,4
2004-05-29,1
...,...
2022-11-29,8
2022-11-30,3
2022-12-13,3
2022-12-14,4


Now we can use the ```.diff()``` function on the dataframe DatetimeIndex. 

In [279]:
tornados['timedelta_since_tornados'] = tornados.index.diff()
print(tornados.dtypes)
tornados

Tornado                               int64
timedelta_since_tornados    timedelta64[ns]
dtype: object


Unnamed: 0_level_0,Tornado,timedelta_since_tornados
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-02-05,3,NaT
2004-03-05,1,29 days
2004-04-29,2,55 days
2004-05-01,4,2 days
2004-05-29,1,28 days
...,...,...
2022-11-29,8,31 days
2022-11-30,3,1 days
2022-12-13,3,13 days
2022-12-14,4,1 days


Pandas executes ```.diff()``` on the dataframe index as ```diff[i] = index[i] - index[i-1]``` (subtracting the previous index value). That's why the first result is the missing value NaT which stands for "Not a Time", the datetime equivalent of NaN.

Notice the data type of the "timedelta_since_tornados" column is data type timedelta64. To convert timedelta objects that are in a series to a numerical data type we can use the ```.dt``` accessor. ```.dt.days``` will pull out the day component of the timedelta objects. The list of attributes you can access from timedeltas can be found in the [Pandas API Reference for ```pd.Timedelta()```](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html). 

In [280]:
tornados['days_since_tornados'] = tornados['timedelta_since_tornados'].dt.days
print(tornados.dtypes)
tornados

Tornado                               int64
timedelta_since_tornados    timedelta64[ns]
days_since_tornados                 float64
dtype: object


Unnamed: 0_level_0,Tornado,timedelta_since_tornados,days_since_tornados
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-02-05,3,NaT,
2004-03-05,1,29 days,29.0
2004-04-29,2,55 days,55.0
2004-05-01,4,2 days,2.0
2004-05-29,1,28 days,28.0
...,...,...,...
2022-11-29,8,31 days,31.0
2022-11-30,3,1 days,1.0
2022-12-13,3,13 days,13.0
2022-12-14,4,1 days,1.0


<div class="alert alert-info"> 
    
# IX. Exercise: Putting it All Together

Use Pandas to read, clean, manipulate, and aggregate weather observations.

## Read the data file
Read the file at data/weatherdata.csv into a Pandas dataframe and render the dataframe to the screen. Don't specify any parameters besides the filename when reading the file into a dataframe.
</div>

In [236]:
# add your code here


<div class="alert alert-info"> 

## Clean the data

Look at all the columns of data. Do you see any mistakes?

Replace 'LosAngeles' at index 3 with 'Los Angeles'.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
Show the data type of each column. 
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
What data type is the windspeed_knots column? 
</div>

Type your answer:

<div class="alert alert-info">
Why did Pandas assign that data type to windspeed_knots?
</div>

Type your answer:

<div class="alert alert-info">
Judging by the data values in the windspeed_knots column, what data type should windspeed_knots probably be?
</div>

Type your answer:

<div class="alert alert-info">
Force the windspeed_knots column to be numeric, then show that the data type of the column did in fact change.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
## Create a new column of data

Create a column of boolean data called ```IsRainy``` that indicates whether there was precipitation.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 

## Calculate average temperature by city

Calculate the average temperature for each city and save the result as new variable called ```avgT```.
</div>

In [None]:
# add your code here


<div class="alert alert-info"> 
    
Looking at the ```df``` dataframe and the ```avgT``` results, how many data values were used to calculate the average New York temperature? 
</div>

Type your answer:

<div class="alert alert-info"> 

## Convert date strings to datetimes

Convert the string values in the dates column to a DatetimeIndex. You don't need to reset the dataframe's index to the dates column though, just convert the string data in the dates column to datetimes.
</div>

In [232]:
# add your code here


<div class="alert alert-info"> 
    
## Sort the dataframe by date, ascending

</div>

In [237]:
# add your code here


<div class="alert alert-danger">

**Sidebar about sorting:** If you have dates in your data, it's best to convert them from string values to datetime objects. Remember how we saw earlier what can happen when sorting strings that contain numbers? If we had full months of date strings, sorting would be problematic due to the lexicographic sort order. This problem is avoided completely if you work with dates as datetimes instead of strings. 
</div>

<div class="alert alert-info"> 
    
## Write the dataframe to file

Write the ```df``` data to a file called ```weatherdata_yourname.csv```, replacing yourname with your first name. Do not include the index column, but do include the column names.
</div>

In [238]:
# add your code here


<div class="alert alert-info"> 

Does your csv file look like this inside?

<img src="images/weatherdata.png" alt="contents of cvs file" width="400"/>

If so, congratulations! You've successfully completed this exercise.
</div>

# X. At a Glance: Language Covered

The Pandas functionality that we covered at a glance...

## Pandas functions

```pd.DataFrame()```, ```pd.read_csv()```, ```pd.to_numeric()```, ```pd.merge()```, ```pd.to_datetime()```, ```pd.Timestamp()```, ```pd.date_range()```, ```pd.Timedelta()```


## Pandas data structure (dataframe or series) methods 

```.head()```, ```.tail()```, ```.describe()```, ```.info()```, ```.unique()```, ```.query()```, ```.mean()```, ```.median()```, ```.max()```, ```.std()```, ```.sum()```, ```.sort_values()```, ```.groupby()```, ```.idxmax()```, ```.rename()```, ```.isna()```, ```.fillna()```, ```.dropna()```, ```.to_csv()```, ```.resample()```, ```.diff()```, ```.nunique()```

## Pandas data structure (dataframe or series) attributes and accessors

```.shape```, ```.dtypes```, ```.loc```, ```.iloc```, ```.dt```


<div class="alert alert-success">

# XI. Learning More About Pandas

For more about Pandas, start on the Pandas website where you can find:

- a nice cheat sheet https://pandas.pydata.org/docs/getting_started/index.html
- a long list of community developed tutorials https://pandas.pydata.org/docs/getting_started/tutorials.html#communitytutorials
- the user guide, which contains a bunch of 10 minute learning guides as well as more in-depth guides by topic https://pandas.pydata.org/docs/user_guide/index.html
</div>