Skip to content
Jonathan Morgan edited this page Jul 9, 2021 · 9 revisions

pandas notes

Table of Contents

DataFrames and Series

Basics

Column names

  • Get list of column names

      my_list = df.columns.values.tolist()
    

Data frame size

  • use data_frame.count()

    • count on data frame returns a Series with a count for each column, where the index for each value is the column name. The count is of non-NA rows within the column, so you can get different counts per column, and it is probably a good idea to not use this as a proxy for number of rows in a DataFrame unless you know the character of a particular column and it is a good proxy. Example:

        # get count() Series.
        total_count = test_df.count()
      
        # get count for column "hooha"
        total_count = total_count[ "hooha" ]
      
        # and convert it to an int.
        total_count = int( total_count )
      
  • use len( data_frame ) - gives the number of rows in the data frame.

Descriptive statistics

DataFrame subsetting

  • retrieve rows X through Y of a data frame - use .iloc[] and slice notation (0-indexed, start is included, end is excluded):

    • for example, retrieve rows 1 through 5000 of temp_df into test_df:

        test_df = temp_df.iloc[ 0 : 5000 ]
      

DataFrame filtering

  • To find rows in a DataFrame where a certain column = a certain value:

      temp_df = csv_df[ csv_df[ "column_name" ] == "column_value" ]
    
  • To filter using a regular expression:

      temp_df = csv_df[ csv_df[ "column_name" ].str.contains( r"<regex>" ) == True ]
    
    • example - find all rows with a date string patterned on "MMM-YY" ("May-01"):

        temp_df = csv_df[ csv_df[ "ici_start_date" ].str.contains( r"^[a-z,A-Z]+-\d+$" ) == True ]
      
  • use "ampersand" to join conditions, not "and".

  • just get specific columns:

    • Pass the "[]" operator a list of column names.  So, to retrieve just columns "A", "B", and "C" from DataFrame example_df:

        temp_df = example_df[ [ "A", "B", "C", ] ]
      
  • Get a single column:

    • include column name in "[]" operator.  So, to retrieve the values for column "A" from DataFrame example_df:

        column_values = example_df[ "A" ]
      
  • More information:

Combine DataFrames

Add an empty column

To add an empty column, inside square brackets, reference a column name that doesn't already exist and set it to a value:

df[ "<column_name>" ] = ""
df[ "<column_name>" ] = np.nan
# ... etc.

Drop a column in a DataFrame

  • The best way to do this in pandas is to use drop:

      df = df.drop( 'column_name', 1 )
    
  • where 1 is the axis number (0 for rows and 1 for columns.)

  • To delete the column without having to reassign df you can do:

      df.drop( 'column_name', axis = 1, inplace = True )
    
  • Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:

      df.drop( df.columns[ [ 0, 1, 3 ] ], axis = 1 )  # df.columns is zero-based pd.Index
    
  • from: http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe#13485766

Rename a column in a DataFrame

Selecting Rows that contain NaN or None (in Object dtype)

# convert any None to "-1".
compare_values_1.fillna( "-1" )
compare_values_1[ compare_values_1.isnull() ] = "-1"

Random sample from DataFrame

Basic random sample from DataFrame can use ".sample( n = <sample_size> )":

current_bin_sample_df = current_bin_df.sample( n = sample_this_many )

Can also sample a percentage of the total size of the data frame using "frac" parameter (50% sample):

current_bin_sample_df = current_bin_df.sample( frac = 0.5 )

More info:

Using iloc

  • iloc[] lets you filter a DataFrame or Series on values in the structure's index.  You can ask for a particular row by passing an index value.

  • use square brackets, not parentheses, to hold arguments to iloc (example of passing an index).

    • Yes:

        user_id = reliability_names_df_IN[ column_name ].iloc[ 0 ]
      
    • No:

        user_id = reliability_names_df_IN[ column_name ].iloc( 0 )
      

Using loc

  • loc[] lets you filter rows in a DataFrame or Series on values within the structure.  It accepts two arguments:

    • specification of which row(s) you want to target.  This could be a list of indexes, a Python slice statement, or it could be a conditional statement like you'd pass to an if statement.
    • specification of which column(s) within matching rows you want to target.  This could be a list of column names, or it could be a conditional statement like you'd pass to an if statement (though I'm honestly not sure how a filter here would work).
  • Use ".notnull()" to find just rows where a given column is not null:

      df.loc[ df[ "column_name" ].notnull() ]
    
  • Just columns 1, 2, and 3 where a given column is not null:

      df.loc[ df[ "column_name" ].notnull(), [ "column1", "column2", "column3" ] ]
    
  • Use ".isnull()" to find rows where a column's value is null:

      df.loc[ df[ "column_name" ].isnull() ]
    
  • example column "code_value", using results of .loc[] to update value of that column:

      df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
    
    • converts column code_value to int

        df[ "code_value" ].astype( int )
      
    • filters rows to only those where the value is <= 0:

        df[ "code_value" ].astype( int ) <= 0
      
    • after a comma, specify the columns you want to interact with in rows that fit your filter criteria (just "code_value", same column we are filtering on):

        df[ "code_value" ].astype( int ) <= 0, [ "code_value" ]
      
    • pass this all to .loc[] so it will do all the filtering of rows and targeting of columns:

        df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ]
      
    • then, assign "new_value" to the result of the .loc[] (the column "code_value" in each matching row):

        df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
      
  • another example of assigning a value to the filtering that loc lets you do:

      cleaned_data_frame.loc[ cleaned_data_frame[ "org_dept" ] == "ANESTHESIOLOGY", [ 'org_dept' ] ] = "MEDICINE"
    
  • add a column:

      df.loc[ :, "column_name" ] = column_values
    
  • pandas doc: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

Replace

# replace all rows with "nan" with ""
df[ column_name ].replace( to_replace = "nan", value = "", inplace = True )

Replace with regex

# For all values that start with "b'" and end with "'", strip
#     that stuff off.
df[ column_name ].replace( to_replace = r"^b'(.*)'$", value = r"\1", regex = True, inline = True )

Replace using loc

  • assigning a value to the filtering that loc lets you do:

      cleaned_data_frame.loc[ cleaned_data_frame[ \"org_dept\" ] == \"ANESTHESIOLOGY\", [ \'org_dept\' ] ] = \"MEDICINE\"
    
  • See "Using loc" above for more details. 

Import/export

DataFrame.to_csv()

df.to_csv( file_path, index = False, encoding = \"utf-8\")

Options for outputting CSV to table

Use psycopg2 copy_from:

import io

sqlalchemy_conn = sqlalchemy_engine.raw_connection()
sqlalchemy_cursor = sqlalchemy_conn.cursor()

# make string buffer
df_buf = io.StringIO()

# store data frame there as CSV
resdf.to_csv( df_buf )

# reset buffer to the beginning
df_buf.seek( 0 )

# write to table.
sqlalchemy_cursor.copy_from( df_buf, table_name, sep = "," )
sqlalchemy_conn.commit()  

Create INSERTS:

# df is the dataframe
if len(resdf) > 0:

    df_columns = list( resdf )

    # create (col1,col2,\...)
    columns = ",".join(df_columns)

    # create VALUES( '%s', '%s', ... ) one '%s' per column
    values = "VALUES( {} )".format( ",".join( [ "%s" for _ in df_columns ] ) )

    # create INSERT INTO table ( columns ) VALUES( '%s', ... )
    insert_stmt = "INSERT INTO {} ({}) {}".format( table_name, columns, values )

    print( "INSERT statement: " + insert_stmt )

    sqlalchemy_conn = sqlalchemy_engine.raw_connection()
    sqlalchemy_cursor = sqlalchemy_conn.cursor()
    psycopg2.extras.execute_batch( sqlalchemy_cursor, insert_stmt, resdf.values )
    sqlalchemy_conn.commit()
    sqlalchemy_cursor.close()

#-- END check if anything in data frame --#

Parsing strings

  • How to do left(), right(), and mid() in pandas:

  • to split a date string (MMDDYYYY) out into separate date part columns:

    • If column is an integer, convert it to string:

        temp_series = df[ "<column_name>" ].apply( str )
        # OR slower...
        temp_series = df[ "<column_name>" ].astype( str )
      
    • zero-pad to 8 digits:

        temp_series = temp_series.str.zfill( 8 )
      
    • parse out date parts:

      • month - left()-most 2 characters: df[ 'column_name' ].str[ : 2 ]
      • day - mid() 2 and 3: df[ 'column_name' ].str[ 2 : 4 ]
      • year - right()-most 4 characters: df[ 'column_name' ].str[ -4 : ]

Performance

dask

pandas and large data sets

function cleanData()

def cleanData( data_frame_IN, column_name_IN, filter_value_IN ):

    """
    Parameters
    ----------
    - data_frame_IN : A pandas DataFrame
    - column_name_IN : Name of the column on the dataframe
    - filter_value_IN : The value that causes rows to be filtered out of data_frame_IN
           if it is present in the column named column_name_IN.

    Returns
    -------
    - cleaned_data_OUT : A Pandas DataFrame containing only rows that did not have the
                specified value in the specified column
    """

    # return reference
    cleaned_data_OUT = None

    # check to make sure that column name is in data frame's list of column names.
    if( column_name_IN not in list( data_frame_IN.columns.values ) ):

        print( "ERROR : Column you specified not present in the dataframe" )
        clean_data_OUT = None

    #-- END check to see if column is in data frame's list of column names. --#

    if filter_value_IN.upper() == "NULL":

        # keep rows where column passed is not NULL.
        cleaned_data_OUT = data_frame_IN[ pandas.notnull( data_frame_IN[ column_name_IN ] ) == True ]

    else:

        # keep rows where column passed in does not contain filter_out_value_IN.
        cleaned_data_OUT = data_frame_IN[ data_frame_IN[ column_name_IN ] != filter_value_IN ]

    #-- END check for "NULL" --#

    return cleaned_data_OUT

#-- END function cleanData() --#

print( "function cleanData() defined at \" + str( datetime.datetime.now() ) )

Dummy-coding data

With scikit-learn

Cheat Sheet

Examples

# CONSTANTS-ish
INFO_COUNT = "count"
INFO_MIN = "min"
INFO_MAX = "max"
INFO_MEAN = "mean"
INFO_MEDIAN = "median"
INFO_STD = "std"
INFO_DESCRIBE = "describe"

def get_variable_info( df_IN, var_name_IN, debug_flag_IN = False ):
    
    # return reference
    info_OUT = None

    # declare variables
    me = "get_variable_info"
    debug_flag = None
    df_info_dict = None
    my_series = None
    my_count = None
    my_min = None
    my_max = None
    my_mean = None
    my_median = None
    my_std = None
    my_describe = None
    
    # init
    debug_flag = debug_flag_IN

    # get column as Series
    my_series = df_IN[ var_name_IN ]

    # retrieve information
    my_count = my_series.count()
    my_min = my_series.min()
    my_max = my_series.max()
    my_mean = my_series.mean()
    my_median = my_series.median()
    my_std = my_series.std()
    my_describe = my_series.describe()

    # create dictionary
    df_info_dict = {}
    df_info_dict[ INFO_COUNT ] = my_count
    df_info_dict[ INFO_MIN ] = my_min
    df_info_dict[ INFO_MAX ] = my_max
    df_info_dict[ INFO_MEAN ] = my_mean
    df_info_dict[ INFO_MEDIAN ] = my_median
    df_info_dict[ INFO_STD ] = my_std
    df_info_dict[ INFO_DESCRIBE ] = my_describe

    if ( debug_flag == True ):
    
        # render output
        status_string = "- count: {my_count}\n- min: {my_min}\n- max: {my_max}\n- mean: {my_mean}\n- median: {my_median}\n- std: {my_std}".format(
            my_count = df_info_dict.get( INFO_COUNT ),
            my_min = df_info_dict.get( INFO_MIN ),
            my_max = df_info_dict.get( INFO_MAX ),
            my_mean = df_info_dict.get( INFO_MEAN ),
            my_median = df_info_dict.get( INFO_MEDIAN ),
            my_std = df_info_dict.get( INFO_STD )
        )

        print( status_string )
        
    #-- END debug --#
    
    info_OUT = df_info_dict
    return info_OUT
    
#-- END function get_variable_info() --#

print( "function get_variable_info() defined at {}".format( datetime.datetime.now() ) )

# CONSTANTS-ish
ROUND_UP = "up"
ROUND_DOWN = "down"

# loop, grabbing subsequent data frame subsets of size bin_size
#     each time through, until last bin, then grab from offset
#     to the end.

def process_bins( df_IN, bin_count_IN, sample_size_IN, do_round_IN = ROUND_UP, debug_flag_IN = False ):

    # return reference
    sample_df_OUT = None
    
    # declare variables
    me = "process_bins"
    debug_flag = None
    status_message = None
    
    # declare variables - bin size
    df_row_count = None
    bin_count = None
    bin_size_exact = None
    bin_size = None
    do_round = None
    
    # declare variables - sample per bin
    sample_size = None
    sample_per_bin_exact = None
    sample_per_bin = None
    
    # declare variables - process bins
    bin_index = None
    bin_number = None
    df_offset = None
    df_start_index = None
    df_end_index = None
    current_bin_df = None
    current_bin_info = None
    sample_count = None
    sample_this_many = None
    current_bin_sample_df = None

    # init
    debug_flag = debug_flag_IN
    df_row_count = len( df_IN )
    bin_count = bin_count_IN
    sample_size = sample_size_IN
    do_round = do_round_IN
    sample_count = 0

    #==========================================================================#
    # bin size
    #==========================================================================#

    # divide the row count by bin_count to get bin size
    bin_size_exact = df_row_count / bin_count

    # round up or round down?
    if ( do_round == ROUND_UP ):

        # use math.ceil() to round up.
        bin_size = math.ceil( bin_size_exact )

    else:

        # either ROUND_DOWN or none set.
        bin_size = math.floor( bin_size_exact )

    #-- END check to see how we round. --#

    # finally, convert to integer.
    bin_size = int( bin_size )

    if ( debug_flag == True ):
        status_message = "break {row_count} rows into {bin_count} bins: bin size = {bin_size} ( exact: {exact_size} )".format(
            row_count = df_row_count,
            bin_count = bin_count,
            bin_size = bin_size,
            exact_size = bin_size_exact
        )
        print( status_message )
    #-- END DEBUG --#
    
    #==========================================================================#
    # sample per bin
    #==========================================================================#
    
    # divide sample size by bin count to get number we select per bin
    sample_per_bin_exact = sample_size / bin_count
    
    # for decimals, round down, so we oversample from last bin (outliers).
    sample_per_bin = math.floor( sample_per_bin_exact )

    # finally, convert to integer.
    sample_per_bin = int( sample_per_bin )

    if ( debug_flag == True ):
        status_message = "sample {per_bin} rows from each of {bin_count} bins ( exact: {exact_size} )".format(
            per_bin = sample_per_bin,
            bin_count = bin_count,
            exact_size = sample_per_bin_exact
        )
        print( status_message )
    #-- END DEBUG --#

    #==========================================================================#
    # process bins
    #==========================================================================#

    # loop over bins
    sample_count = 0
    for bin_index in range( 0, bin_count ):

        # init - calculate offset
        df_offset = bin_index * bin_size
        df_start_index = df_offset
        df_end_index = df_offset + bin_size

        # are we at the end?
        bin_number = bin_index + 1

        if ( debug_flag == True ):
            status_message = "\n\n----> BIN #{bin_number} - offset: {my_offset}; start: {start_index}; end: {end_index}; bin index: {my_index}".format(
                bin_number = bin_number,
                my_offset = df_offset,
                start_index = df_start_index,
                end_index = df_end_index,
                my_index = bin_index
            )
            print( status_message )
        #-- END DEBUG --#

        if ( bin_number == bin_count ):

            # last bin - get from offset to the end.
            current_bin_df = df_IN[ df_start_index : ]
            
            # sample as many as needed to get to total.
            sample_this_many = sample_size - sample_count

        else:

            # not last bin - get next bin_size past offset.
            current_bin_df = df_IN[ df_start_index : df_end_index ]
            
            # sample as many as needed from current bin...
            sample_this_many = sample_per_bin
            
        #-- END check if last bin or not --#

        # sample
        current_bin_sample_df = current_bin_df.sample( n = sample_this_many )
        print( "\n\nSample from bin #{}".format( bin_number ) )
        print( current_bin_sample_df )
        
        # first bin?
        if ( sample_df_OUT is None ):
            
            # first bin - just store the sample DataFrame
            sample_df_OUT = current_bin_sample_df
            
        else:
            
            # append sample DataFrame to output DataFrame
            sample_df_OUT = sample_df_OUT.append( current_bin_sample_df, ignore_index = True )
            
        #-- END check if first bin or not. --#
        
        # update total.
        sample_count += sample_this_many
        
    #-- END loop over bins --#
    
    return sample_df_OUT
    
#-- END function process_bins() --#

print( "function process_bins() defined at {}".format( datetime.datetime.now() ) )

Clone this wiki locally