# Exploring Scouting Data with Pandas

### 1. References
* [Getting Started with Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started)

Pandas is an extensive package that will take time to learn. It's very powerful, which makes the effort worth it.

### 2. Imports
There are several python modules that we need to work with scouting data in a Jupyter notebook:

In [1]:
import pickle

import pandas as pd

### 3. Read Data from Disk

In [4]:
with open('test_evt2.pickle', 'rb') as file:
    sdata = pickle.load(file)
sdata.keys()

dict_keys(['schedule', 'teams', 'measures'])

In notebook 01, we extracted scouting data from the SQL database, converted it to Pandas dataframes, and saved the dataframes to the hard drive using Python pickle files. The dataframes are contained within a Python dictionary.

We'll be using the unpickled dataframes frequently, so let's give them names that are easy to type:

In [5]:
sched = sdata['schedule']
teams = sdata['teams']
meas = sdata['measures']

### 4. Explore the *teams* dataframe

#### head() and shape
[head() function documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head)

Dataframe are often large, so it's helpful when initially getting familiar with a dataframe to only view a few rows of data. The `Dataframe.head()` function is useful for this:

In [6]:
teams.head()

Unnamed: 0,id,name,long_name,city,state,region,year_founded
0,5,1318,Issaquah Robotics Society,Issaquah,Washington,,2004
1,7254,2926,Robo Sparks,Wapato,Washington,,2009
2,1435,3070,Team Pronto,Seattle,Washington,,2009
3,5134,2990,Hotwire,Turner,Oregon,,2009
4,23,4461,Ramen,Seattle,Washington,,2013


By default, `head()` returns the first 5 rows. One can also pass in the number fo rows desired as a parameter.

In [7]:
teams.head(13)

Unnamed: 0,id,name,long_name,city,state,region,year_founded
0,5,1318,Issaquah Robotics Society,Issaquah,Washington,,2004.0
1,7254,2926,Robo Sparks,Wapato,Washington,,2009.0
2,1435,3070,Team Pronto,Seattle,Washington,,2009.0
3,5134,2990,Hotwire,Turner,Oregon,,2009.0
4,23,4461,Ramen,Seattle,Washington,,2013.0
5,28,3393,Horns of Havoc,Puyallup,Washington,,2010.0
6,2809,na,,,,,
7,1425,4911,CyberKnights,Seattle,Washington,,2014.0
8,4,5937,MI-Robotics,Mercer Island,Washington,,2016.0
9,27,5683,Hello World,Auburn,Washington,,2015.0


Use the shape attribute to determine the size of the dataframe.

In [9]:
teams.shape

(35, 7)

The shape attribute is a two-element tuple. The first element is the number of rows and the second element is the number of columns. How would you extract just the number of rows?

### Filtering Dataframes

Which teams were founded in 2010 or earlier?

Before we figure this out, let's convert the year_founded column to numeric values. (The column currently consists of strings.)

In [36]:
teams.year_founded = pd.to_numeric(teams.year_founded, errors='coerce',
                                  downcast='unsigned')

In [39]:
teams[teams.year_founded <= 2010]

Unnamed: 0,id,name,long_name,city,state,region,year_founded
0,5,1318,Issaquah Robotics Society,Issaquah,Washington,,2004.0
1,7254,2926,Robo Sparks,Wapato,Washington,,2009.0
2,1435,3070,Team Pronto,Seattle,Washington,,2009.0
3,5134,2990,Hotwire,Turner,Oregon,,2009.0
5,28,3393,Horns of Havoc,Puyallup,Washington,,2010.0
11,36,2929,JAGBOTS,Puyallup,Washington,,2009.0
12,33,360,The Revolution,Tacoma,Washington,,2000.0
13,2,3237,Event Horizon,Spanaway,Washington,,2010.0
14,20,2906,Sentinel Prime Robotics,Spanaway,Washington,,2009.0
17,4278,948,NRG (Newport Robotics Group),Bellevue,Washington,,2002.0


Why did that work?

When doing a Boolean comparison on a dataframe column, Pandas returns a list of 
Boolean values:

In [43]:
teams.year_founded <= 2010

0      True
1      True
2      True
3      True
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11     True
12     True
13     True
14     True
15    False
16    False
17     True
18     True
19    False
20    False
21     True
22    False
23    False
24    False
25    False
26     True
27     True
28    False
29    False
30     True
31     True
32    False
33     True
34    False
Name: year_founded, dtype: bool

If a list of Booleans is passed into the square brackets after a dataframe, Pandas will return another dataframe containing only the rows that correspond to the value of `True`. We can also see of a column's contents match the values of a list:

In [47]:
teams[teams.city.isin(['Seattle', 'Tacoma'])]

Unnamed: 0,id,name,long_name,city,state,region,year_founded
2,1435,3070,Team Pronto,Seattle,Washington,,2009.0
4,23,4461,Ramen,Seattle,Washington,,2013.0
7,1425,4911,CyberKnights,Seattle,Washington,,2014.0
12,33,360,The Revolution,Tacoma,Washington,,2000.0
19,952,6503,Iron Dragon,Seattle,Washington,,2017.0
25,974,3684,Electric Eagles,Seattle,Washington,,2011.0
32,4283,3574,HIGH TEKERZ,Seattle,Washington,,2011.0
