### A DataFrame is an Excel worksheet.  Sort of.

You can think of a pandas DataFrame as a worksheet "tab".    It's a 2d grid of cells, each of which can hold a value.  The rows and columns can be addressed and new ones added.  Values can be combined as needed.  We can search, do lookups, sort, change output format, and perform most other spreadsheet operations.

What do we lose?   The GUI (graphical user interface).
What do we gain?   Power.

This section will walk through some of the basic "spreadsheet" operations you can do with pandas.

In [51]:
import pandas as pd
import sys
import os
from string import ascii_lowercase

#This adds ../custom_utils/ to the Python search path 
sys.path.insert(1,os.path.join(os.path.split(os.getcwd())[0], 'custom_utils'))
import utilities

In [53]:
#Let's make up some data...

data = \
    [
      ['sport',   'duration', 'fans'],
      ['baseball', 180, 1100],
      ['wrestling', 30,  300],
      ['gymnastics', 1,  120],      
    ]

#...and provide column headings just like Excel's:

cols = ['A', 'B', 'C' ]

#Here, we use a "constructor" to build a DataFrame
sports_df = pd.DataFrame(data=data, columns=cols)


#This is enough to print it:
print(sports_df)      #straight-up text
sports_df             #fancy HTML (works for DataFrames)   


            A         B     C
0       sport  duration  fans
1    baseball       180  1100
2   wrestling        30   300
3  gymnastics         1   120


Unnamed: 0,A,B,C
0,sport,duration,fans
1,baseball,180,1100
2,wrestling,30,300
3,gymnastics,1,120


We can access information in the DataFrame much like we would in Excel.   In pandas you can use the loc[ate] built into the main data objects..  A couple things to note:

- You can locate ranges using the index (it's like the "R1C1" notation);

- In pandas you use a verbose form of reference to include the source, as you 
  might in Excel using:  "=Sheet1!A1";
  
- When you use the <b>loc[ate] method</b>, you create slices of the rows and columns you want,
  separating each with a comma like this:

<b>DataFrame.loc[my_row_slice , my_column_slice]</b>             

... note the square brackets!

In [54]:
#All rows in some column
fans =  sports_df.loc[:, "C"]               #Col C         "C:C" in Excel

#All columns in some row
baseball = sports_df.loc[1, :]              #Row 1         "1:1" in Excel

#Some specific cell
wrestling_duration = sports_df.loc[2, "B"]  #Row 2, Col B  "B1" in Excel

#A single row or column is a pandas Series object.  A single value is just a variable.
#   DataFrames print more nicely, so we'll convert.  You can safely ignore this for now.
fans_df = fans.to_frame();  baseball_df=baseball.to_frame(); 
wrestling_duration_df=pd.DataFrame([wrestling_duration], columns = ['Single Cell'], index = [""])
display_w('fans_df', 'baseball_df', 'wrestling_duration_df')


Unnamed: 0,C
0,fans
1,1100
2,300
3,120

Unnamed: 0,1
A,baseball
B,180
C,1100

Unnamed: 0,Single Cell
,30


This all looks familiar, right?  It's a simple spreadsheet with a bit of data.

But we can do better.  Sticking with Excel's notion of row and column names can be cumbersome.    If you think about it, what's intuitive about the column name "C"?   And what's meaningful about the row name "2"?

When you use Python, you aren't stuck with the "R1C1" indexing.    In fact, you can "bolt on" any sort of index you can dream up.   More on indexing later, but here's an easy way to build the DataFrame to incorporate sensible row and column headers:

In [55]:
#Figure out what rows of data are really 'data' and which are headings
df_data = data[1:4]  #skip the first row (which has the labels)
df_cols = data[0]    #just the first row

#You can explicitly specify the data and the column headings
df_fixed_cols = pd.DataFrame(df_data, columns=df_cols)

#You can pick one of the columns to serve as the row index like this:
df_fixed_all = df_fixed_cols.set_index('sport')


display_w('sports_df', 'df_fixed_cols', 'df_fixed_all')

Unnamed: 0,A,B,C
0,sport,duration,fans
1,baseball,180,1100
2,wrestling,30,300
3,gymnastics,1,120

Unnamed: 0,sport,duration,fans
0,baseball,180,1100
1,wrestling,30,300
2,gymnastics,1,120

Unnamed: 0_level_0,duration,fans
sport,Unnamed: 1_level_1,Unnamed: 2_level_1
baseball,180,1100
wrestling,30,300
gymnastics,1,120


Looking better, right?   

Now, with the row and column indices fixed, we have a much easier time working with the data we want.  Instead of referring to index values like "A" and "3" and can now refer to the data in terms of "baseball" or "fans".   We'll call our fixed DataFrame 'sports_df' then ..

... if  we just want to look at fans we can go:

<b>sports_df['fans']</b>

Here are a couple of examples:

In [56]:
sports_df = df_fixed_all

#Just the fans
fans = sports_df['fans']

#Just baseball
baseball = sports_df.loc['baseball', :]

#Again, we're using for display aesthetics and you can safely ignore this.
fans_df = fans.to_frame();  baseball_df=baseball.to_frame(); 
display_w('df_fixed_all', 'fans_df', 'baseball_df', )

Unnamed: 0_level_0,duration,fans
sport,Unnamed: 1_level_1,Unnamed: 2_level_1
baseball,180,1100
wrestling,30,300
gymnastics,1,120

Unnamed: 0_level_0,fans
sport,Unnamed: 1_level_1
baseball,1100
wrestling,300
gymnastics,120

Unnamed: 0,baseball
duration,180
fans,1100


##  "Garden Variety" Spreadsheet Operations

This section will walk you through the Python equivalent of normal spreadsheet tasks.  You can use this as a cheat sheet later.

### Add columns

One big difference between Excel and pandas is that you don't have to fool around with every single cell in pandas.  Unless you tell it differently, pandas <u>assumes</u> that you're working on intact columns of data.   In fact, if you want to use a really shorthand form of address, you can just specify the column name and pandas will know what to do.  These two are equivalent:

sports_df.loc['fans',:]     and      sports_df['fans']

If you want to add a new column, you can just go ahead and specify it like this:

sports_df['cheer'] = 'Go Cubs!'

Math operations are specified on a column-wise bases by default.  If you want to double the fan base for these sports you could go:

sports_df['2x fans'] = sports_df['fans'] * 2

In [57]:
#Add columns
original = sports_df.copy()
sports_df['cheer'] = 'Go Cubs!'
sports_df['2x fans'] = sports_df['fans'] * 2
display_w('original','sports_df')

Unnamed: 0_level_0,duration,fans
sport,Unnamed: 1_level_1,Unnamed: 2_level_1
baseball,180,1100
wrestling,30,300
gymnastics,1,120

Unnamed: 0_level_0,duration,fans,cheer,2x fans
sport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baseball,180,1100,Go Cubs!,2200
wrestling,30,300,Go Cubs!,600
gymnastics,1,120,Go Cubs!,240


You can update individual cells with the <b>loc[ate]</b> method as before.   For instance, you may think "Go Cubs!" would sound strange unless the sport is baseball.  So you might go:

In [58]:
original = sports_df.copy()
sports_df.loc['wrestling', 'cheer'] = "Flip 'em"
sports_df.loc['gymnastics', 'cheer'] = "Tumble 'em"
display_w('original','sports_df')

Unnamed: 0_level_0,duration,fans,cheer,2x fans
sport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baseball,180,1100,Go Cubs!,2200
wrestling,30,300,Go Cubs!,600
gymnastics,1,120,Go Cubs!,240

Unnamed: 0_level_0,duration,fans,cheer,2x fans
sport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baseball,180,1100,Go Cubs!,2200
wrestling,30,300,Flip 'em',600
gymnastics,1,120,Tumble 'em,240


In [28]:
def assess_crowd(row_data):
    if row_data > 1000:
        return 'large'
    elif row_data > 200:
        return 'medium'
    else:
        return 'small'

sports_df = utilities.create_sports_data()
sports_df['crowd_size'] = sports_df['fans'].apply(assess_crowd)
display_w('sports_df_original', 'sports_df')
    


KeyError: 'fans'

In [2]:
dir(utilities)Ds

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'create_sports_data',
 'pd']

In [5]:
df = utilities.create_sports_data()
df

Unnamed: 0,sport,duration,fans
0,baseball,180,1100
1,wrestling,30,300
2,gymnastics,1,120


In [None]:

outfn_base = 'sports'
excel_ext = '.xlsx'
csv_ext = '.csv'

def create_data():
    """Creates and returns a DataFrame"""
    data = \
        [
          ['baseball', 180, 1100],
          ['wrestling', 30,  300],
          ['gymnastics', 1,  120],      
        ]
    cols = ['sport', 'duration', 'fans' ]
    
    sports_df = pd.DataFrame(data=data, columns=cols)
    
    return sports_df

df = create_data()  

#Save as an Excel workbook and equvalent CSV file
df.to_excel(outfn_base + excel_ext)
df.to_csv(outfn_base + csv_ext)

In [None]:
from IPython.core.display import display

with pd.option_context('display.precision', 2):
    html = (df.style
              .applymap(color_negative_red)
              .apply(highlight_max))
html

In [None]:
df