<a href="https://colab.research.google.com/github/prof-rossetti/intro-to-python/blob/master/notes/python/packages/Pandas_Package_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [None]:

#
# SETUP CELL (DOWNLOAD A CSV FILE TO COLAB)
#
# to see the filesystem in colab, click the file folder in the left navbar

# this is the os module. very helpful for managing the filesystem
# reference: https://docs.python.org/3/library/os.html
import os

csv_filepath = "jeter_stats.csv"
print(csv_filepath)

if not os.path.isfile(csv_filepath):
    print("DOWNLOADING DATA...")
    # FYI: this wget command is a terminal command, NOT python
    # ... in colab, we can execute terminal commands by prefixing them with an exclamation point
    # ... students are not responsible for knowing terminal commands like this
    !wget -q https://raw.githubusercontent.com/prof-rossetti/intro-to-python/master/data/jeter_stats.csv 


jeter_stats.csv


# The `pandas` Package



Reference:

  + [Pandas Website](http://pandas.pydata.org/)
  + [Pandas Source](https://github.com/pandas-dev/pandas)
  + [Pandas Docs](http://pandas.pydata.org/pandas-docs/stable/)
  + [Pandas Docs - API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)


The `pandas` package provides capabilities for working with structured data, including spreadsheet-like objects called "DataFrames".





# Pandas Data Frames

The Pandas [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) datatype represents a table of data, like a spreadsheet.






## Creating Data Frames

We can create `DataFrame` objects from CSV files or create them ourselves from eligible data structures, including: a list of lists, a dictionary of lists, and a list of dictionaries.




### From CSV

Use the pandas [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to import a CSV file to a new `DataFrame` object.





In [None]:

from pandas import read_csv

# READING CSV FILES

stats_df = read_csv(csv_filepath) 
print(type(stats_df))
print(stats_df.head())

<class 'pandas.core.frame.DataFrame'>
   year  games  at_bats  runs  hits  walks
0  1995     15       48     5    12      3
1  1996    157      582   104   183     48
2  1997    159      654   116   190     74
3  1998    149      626   127   203     57
4  1999    158      627   134   219     91



### From List of Lists

When creating a dataframe from a list of lists, column names will be numeric by default, unless you set them yourself (during or after initialization).


In [None]:
from pandas import DataFrame

my_list = [
  [1, "A"],
  [2, "B"],
  [3, "C"]
]

df = DataFrame(my_list)
print(df.columns)
print(df)

df.columns = ["id", "grade"] # can set column names after initialization
print(df.columns)
print(df)

df = DataFrame(my_list, columns=["id", "grade"]) # can set column names during initialization
print(df.columns)
print(df)

# alternative variation, if you want to treat a certain column as the index
# but usually leaving the default index is fine
#df.set_index("id", inplace=True)
#print(df.columns)
#print(df)


RangeIndex(start=0, stop=2, step=1)
   0  1
0  1  A
1  2  B
2  3  C
Index(['id', 'grade'], dtype='object')
   id grade
0   1     A
1   2     B
2   3     C
Index(['id', 'grade'], dtype='object')
   id grade
0   1     A
1   2     B
2   3     C




### From Dictionary of Lists

When creating a dataframe from a dictionary of lists, column names will be the dictionary keys. 


In [None]:
# from pandas import DataFrame

my_dict = {
    "id": [1,2,3],
    "grade": ["A", "B", "C"]
}

df = DataFrame(my_dict)
print(df.columns)
print(df)


Index(['id', 'grade'], dtype='object')
   id grade
0   1     A
1   2     B
2   3     C


### From List of Dictionaries (Records)

When creating a dataframe from a list of dictionaries, column names will be the dictionary keys. 





In [None]:
# from pandas import DataFrame

my_records = [
    {"id": 1, "grade":"A"},
    {"id": 2, "grade":"B"},
    {"id": 3, "grade":"C"}
]

df = DataFrame(my_records)
print(df.columns)
print(df)

Index(['id', 'grade'], dtype='object')
   id grade
0   1     A
1   2     B
2   3     C


## Using Data Frames



### Row Operations

References:
  + [`DataFrame.iloc[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)
  + [`DataFrame.iterrows()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html)




In [None]:

print("--------------------------")
print("INSPECT FIRST/LAST X ROWS:")
print(stats_df.head(3))
print(stats_df.tail(3))

print("--------------------------")
print("COUNTING ROWS:")
print(len(stats_df))

print("--------------------------")
print("LOOPING THROUGH ROWS:")
for index, row in stats_df.iterrows():
    print(row["year"])

print("--------------------------")
print("REFERENCE A SPECIFIC ROW:")
print(stats_df.iloc[0])

print("--------------------------")
print("CONVERT ROW TO DICTIONARY:")
print(stats_df.iloc[0].to_dict())


--------------------------
INSPECT FIRST/LAST X ROWS:
   year  games  at_bats  runs  hits  walks
0  1995     15       48     5    12      3
1  1996    157      582   104   183     48
2  1997    159      654   116   190     74
    year  games  at_bats  runs  hits  walks
17  2012    159      683    99   216     45
18  2013     17       63     8    12      8
19  2014    145      581    47   149     35
--------------------------
COUNTING ROWS:
20
--------------------------
LOOPING THROUGH ROWS:
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
--------------------------
REFERENCE A SPECIFIC ROW:
year       1995
games        15
at_bats      48
runs          5
hits         12
walks         3
Name: 0, dtype: int64
--------------------------
CONVERT ROW TO DICTIONARY:
{'year': 1995, 'games': 15, 'at_bats': 48, 'runs': 5, 'hits': 12, 'walks': 3}


### Column (`Series`) Operations 

A [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object represents a column in the dataframe. Each series has as many values as the dataframe has rows.

In [None]:
# selecting multiple columns yields a dataframe object
print("--------------------")
print("SELECTING MANY COLUMNS:")
print(type(stats_df[["games", "year"]]))
print(stats_df[["games", "year"]])

# selecting one column yields a series object
print("--------------------")
print("SELECTING ONE COLUMN:")
print(type(stats_df["games"]))
print(stats_df["games"])

--------------------
SELECTING MANY COLUMNS:
<class 'pandas.core.frame.DataFrame'>
    games  year
0      15  1995
1     157  1996
2     159  1997
3     149  1998
4     158  1999
5     148  2000
6     150  2001
7     157  2002
8     119  2003
9     154  2004
10    159  2005
11    154  2006
12    156  2007
13    150  2008
14    153  2009
15    157  2010
16    131  2011
17    159  2012
18     17  2013
19    145  2014
--------------------
SELECTING ONE COLUMN:
<class 'pandas.core.series.Series'>
0      15
1     157
2     159
3     149
4     158
5     148
6     150
7     157
8     119
9     154
10    159
11    154
12    156
13    150
14    153
15    157
16    131
17    159
18     17
19    145
Name: games, dtype: int64


Getting values and [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) for a given series:

In [None]:

print("--------------------")
print("VALUES:")
print(stats_df["games"].values)

print("--------------------")
print("DISTINCT VALUES:")
print(stats_df["games"].unique())

print("--------------------")
print("VALUE COUNTS:")
print(stats_df["games"].value_counts())

print("--------------------")
print("VALUE COUNTS (NORMALIZED):")
print(stats_df["games"].value_counts(normalize=True))

# FYI: can convert this result to its own df:
# counts_df = stats_df["games"].value_counts().to_frame()

--------------------
VALUES:
[ 15 157 159 149 158 148 150 157 119 154 159 154 156 150 153 157 131 159
  17 145]
--------------------
DISTINCT VALUES:
[ 15 157 159 149 158 148 150 119 154 156 153 131  17 145]
--------------------
VALUE COUNTS:
159    3
157    3
154    2
150    2
158    1
156    1
153    1
119    1
149    1
148    1
17     1
15     1
145    1
131    1
Name: games, dtype: int64
--------------------
VALUE COUNTS (NORMALIZED):
159    0.15
157    0.15
154    0.10
150    0.10
158    0.05
156    0.05
153    0.05
119    0.05
149    0.05
148    0.05
17     0.05
15     0.05
145    0.05
131    0.05
Name: games, dtype: float64


We can aggregate values in a series:

In [None]:

print("----------------------------")
print("SERIES AGGREGATIONS...")

print("SUM:", stats_df["games"].sum()) 
print("COUNT DISTINCT:", stats_df["games"].nunique())

print("MIN:", stats_df["games"].min()) 
print("MAX:", stats_df["games"].max())

print("MEAN:", stats_df["games"].mean()) 
print("MEDIAN:", stats_df["games"].median()) 


----------------------------
SERIES AGGREGATIONS...
SUM: 2747
COUNT DISTINCT: 14
MIN: 15
MAX: 159
MEAN: 137.35
MEDIAN: 153.5


In [None]:
stats_df["games"].describe()

count     20.000000
mean     137.350000
std       42.671666
min       15.000000
25%      147.250000
50%      153.500000
75%      157.000000
max      159.000000
Name: games, dtype: float64

We can copy columns, perform column-wise operations, and create new ad-hoc columns as desired:


In [None]:
print("-------------------")
print("COPYING COLUMNS...")
stats_df["year_copy"] = stats_df["year"]
print(stats_df.columns)

print("-------------------")
print("DROPPING COLUMNS...")
stats_df.drop(columns=["year_copy"], inplace=True)
print(stats_df.columns)

print("-------------------")
print("COLUMN-WISE OPERATIONS...")

stats_df["batting_avg"] = stats_df["hits"] / stats_df["at_bats"]
stats_df["on_base_pct"] = (stats_df["hits"] + stats_df["walks"]) / stats_df["at_bats"]
print(stats_df.columns)
print(stats_df.head())


-------------------
COPYING COLUMNS...
Index(['year', 'games', 'at_bats', 'runs', 'hits', 'walks', 'year_copy'], dtype='object')
-------------------
DROPPING COLUMNS...
Index(['year', 'games', 'at_bats', 'runs', 'hits', 'walks'], dtype='object')
-------------------
COLUMN-WISE OPERATIONS...
Index(['year', 'games', 'at_bats', 'runs', 'hits', 'walks', 'batting_avg',
       'on_base_pct'],
      dtype='object')
   year  games  at_bats  runs  hits  walks  batting_avg  on_base_pct
0  1995     15       48     5    12      3     0.250000     0.312500
1  1996    157      582   104   183     48     0.314433     0.396907
2  1997    159      654   116   190     74     0.290520     0.403670
3  1998    149      626   127   203     57     0.324281     0.415335
4  1999    158      627   134   219     91     0.349282     0.494418


Transforming values in a series using [`pandas.Series.map()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) or [`pandas.Series.apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html):


In [None]:

def fmt_pct(number):
    """ Formats a decimal number like 0.7 as a percentage: "70.0%" """
    return f"{(number * 100):.1f}%"

stats_df["batting_avg_label"] = stats_df["batting_avg"].map(fmt_pct)
stats_df["on_base_pct_label"] = stats_df["on_base_pct"].map(fmt_pct)

print(stats_df.head())

   year  games  at_bats  ...  on_base_pct  batting_avg_label  on_base_pct_label
0  1995     15       48  ...     0.312500              25.0%              31.2%
1  1996    157      582  ...     0.396907              31.4%              39.7%
2  1997    159      654  ...     0.403670              29.1%              40.4%
3  1998    149      626  ...     0.415335              32.4%              41.5%
4  1999    158      627  ...     0.494418              34.9%              49.4%

[5 rows x 10 columns]


### Filtering Rows

We can filter the rows to return only those matching a given condition:

In [None]:

print("FILTERING BASED ON NUMERIC OPERATIONS...")
print(stats_df[stats_df["games"] > 150])

print("FILTERING BASED ON STRING OPERATIONS...")
df = DataFrame([
    {"id":1, "grade": "A+"},
    {"id":2, "grade": "A"},
    {"id":3, "grade": "A-"},
    {"id":4, "grade": "B+"},
    {"id":5, "grade": "B"}
])
print(df[df['grade'].str.contains("A")])




FILTERING BASED ON NUMERIC OPERATIONS...
    year  games  at_bats  ...  on_base_pct  batting_avg_label  on_base_pct_label
1   1996    157      582  ...     0.396907              31.4%              39.7%
2   1997    159      654  ...     0.403670              29.1%              40.4%
4   1999    158      627  ...     0.494418              34.9%              49.4%
7   2002    157      644  ...     0.409938              29.7%              41.0%
9   2004    154      643  ...     0.363919              29.2%              36.4%
10  2005    159      654  ...     0.426606              30.9%              42.7%
11  2006    154      623  ...     0.454254              34.3%              45.4%
12  2007    156      639  ...     0.410016              32.2%              41.0%
14  2009    153      634  ...     0.447950              33.4%              44.8%
15  2010    157      663  ...     0.365008              27.0%              36.5%
17  2012    159      683  ...     0.382138              31.6%       

### Grouping and Aggregating Rows

It is possible to use a dataframe's `groupby` method to create a `GroupBy` object, which is like a pivot-table.

Reference:

  + [`DataFrame.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
  + [`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)
  + [`GroupBy.aggregate`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html)
  + [Naming columns returned by groupby aggregate functions](https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function) lesson: ravel after groupby to rename the columns




In [None]:
teams_df = DataFrame([
    {"city": "New York", "name": "Yankees"},
    {"city": "New York", "name": "Mets"},
    {"city": "Boston", "name": "Red Sox"},
    {"city": "New Haven", "name": "Ravens"}
])

print(type(teams_df.groupby(["city"]))) #> DataFrameGroupBy a.k.a. GroupBy

# can use aggregate function on the GroupBy object
teams_pivot = teams_df.groupby(["city"]).agg({'name': ['nunique']})
print(type(teams_pivot))



# TIP: reset the columns to be a single level, to make sorting easier later
# yeah, we're using the ravel function. just treat it as boilerplate
# so we can reference the aggregated column names more easily later
print("-----------")
print("COLUMN NAMES AFTER GROUPBY:")
print(teams_pivot.columns)
print(teams_pivot[('name', 'nunique')]) # this is how we would need to reference the col name if we don't ravel to rename

teams_pivot.columns = ["_".join(col) for col in teams_pivot.columns.ravel()]
print("-----------")
print("COLUMN NAMES AFTER RAVEL:")
print(teams_pivot.columns)
print(teams_pivot["name_nunique"]) # this is how we reference the col name after we ravel to rename

# sorting the pivot table
teams_pivot.sort_values(by="name_nunique", ascending=False, inplace=True)
print(teams_pivot["name_nunique"])


pandas.core.groupby.generic.DataFrameGroupBy
<class 'pandas.core.frame.DataFrame'>
-----------
COLUMN NAMES AFTER GROUPBY:
MultiIndex([('name', 'nunique')],
           )
city
Boston       1
New Haven    1
New York     2
Name: (name, nunique), dtype: int64
-----------
COLUMN NAMES AFTER RAVEL:
Index(['name_nunique'], dtype='object')
city
Boston       1
New Haven    1
New York     2
Name: name_nunique, dtype: int64
city
New York     2
Boston       1
New Haven    1
Name: name_nunique, dtype: int64


## Exporting Data Frames

### To List of Dictionaries (Records)

Convert a data frame to list of dictionaries:

```python
df.to_dict("records") # NOTE: "records" is a specific parameter of the to_dict() function, not a characteristic of the underlying data
```

### To List of Lists 

Convert a dataframe to list of lists, each representing a row in the dataframe:

```python
df.values.tolist()
```

### To CSV File

Write a dataframe to a CSV file:

```python
df.to_csv("my_data_copy.csv")
```