pandas notes

pandas notes

DataFrames and Series

indexing: http://pandas.pydata.org/pandas-docs/stable/indexing.htm
good examples: https://github.com/CSSIP-AIR/Big-Data-Workbooks/blob/master/07.%20Machine%20Learning/Machine%20Learning.ipynb

find unique values in a column:

  #List unique values in the df[ 'name' ] column
  df.name.unique()

Basics

Column names

Get list of column names
```
  my_list = df.columns.values.tolist()
```

Data frame size

use data_frame.count()
- count on data frame returns a Series with a count for each column, where the index for each value is the column name. The count is of non-NA rows within the column, so you can get different counts per column, and it is probably a good idea to not use this as a proxy for number of rows in a DataFrame unless you know the character of a particular column and it is a good proxy. Example:
```
  # get count() Series.
  total_count = test_df.count()

  # get count for column "hooha"
  total_count = total_count[ "hooha" ]

  # and convert it to an int.
  total_count = int( total_count )
```
  - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html
use len( data_frame ) - gives the number of rows in the data frame.

Descriptive statistics

how to pull individual items out of "describe()" output: https://stackoverflow.com/questions/54393571/pandas-df-describe-how-do-i-extract-values-into-dataframe
describe() function doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

DataFrame subsetting

retrieve rows X through Y of a data frame - use .iloc[] and slice notation (0-indexed, start is included, end is excluded):
- for example, retrieve rows 1 through 5000 of temp_df into test_df:
```
  test_df = temp_df.iloc[ 0 : 5000 ]
```

DataFrame filtering

To find rows in a DataFrame where a certain column = a certain value:
```
  temp_df = csv_df[ csv_df[ "column_name" ] == "column_value" ]
```

To filter using a regular expression:

  temp_df = csv_df[ csv_df[ "column_name" ].str.contains( r"<regex>" ) == True ]

example - find all rows with a date string patterned on "MMM-YY" ("May-01"):

  temp_df = csv_df[ csv_df[ "ici_start_date" ].str.contains( r"^[a-z,A-Z]+-\d+$" ) == True ]

use "ampersand" to join conditions, not "and".

This works:

  person_df_lookup = person_df_IN[ ( person_df_IN[ column_name_person_id_1 ] > 0 ) & ( person_df_IN[ column_name_person_id_2 ] > 0 ) ]

this does not:

  person_df_lookup = person_df_IN[ ( person_df_IN[ column_name_person_id_1 ] > 0 ) and ( person_df_IN[ column_name_person_id_2 ] > 0 ) ]

http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining#11872393

just get specific columns:
- Pass the "[]" operator a list of column names. So, to retrieve just columns "A", "B", and "C" from DataFrame example_df:
```
  temp_df = example_df[ [ "A", "B", "C", ] ]
```
Get a single column:
- include column name in "[]" operator. So, to retrieve the values for column "A" from DataFrame example_df:
```
  column_values = example_df[ "A" ]
```
More information:
- http://pandas.pydata.org/pandas-docs/stable/indexing.html
- http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Combine DataFrames

use append() to tack rows onto the end of an existing DataFrame:
```
  df.append(df2, ignore_index=True)
```
- Keeps all columns present in either data frame.
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
OR concat():
```
  df_merged = pandas.concat( [ df1, df2 ], ignore_index = True )
```
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
once you are done combining, you can reset indices to the order of rows in the DataFrame using reset_index( inplace = True ). By default, this will move the existing index into a column named "level_0", then create new index from 0 to row count - 1, with no holes or inconsistencies. Example:
```
  volume_df.reset_index( inplace = True )
```
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
more ways to merge and combine: http://pandas.pydata.org/pandas-docs/stable/merging.html

Add an empty column

To add an empty column, inside square brackets, reference a column name that doesn't already exist and set it to a value:

df[ "<column_name>" ] = ""
df[ "<column_name>" ] = np.nan
# ... etc.

Drop a column in a DataFrame

The best way to do this in pandas is to use drop:
```
  df = df.drop( 'column_name', 1 )
```
where 1 is the axis number (0 for rows and 1 for columns.)
To delete the column without having to reassign df you can do:
```
  df.drop( 'column_name', axis = 1, inplace = True )
```
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
```
  df.drop( df.columns[ [ 0, 1, 3 ] ], axis = 1 )  # df.columns is zero-based pd.Index
```
from: http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe#13485766

Rename a column in a DataFrame

Use the rename function and refer the columns to be renamed. Not all the columns have to be renamed:

  df = df.rename( columns = { 'oldName1': 'newName1', 'oldName2': 'newName2' } )
  # OR
  df.rename( columns = { 'oldName1' : 'newName1', 'oldName2': 'newName2' }, inplace = True )

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html
from http://stackoverflow.com/questions/11346283/renaming-columns-in-pandas#11354850

Selecting Rows that contain NaN or None (in Object dtype)

# convert any None to "-1".
compare_values_1.fillna( "-1" )
compare_values_1[ compare_values_1.isnull() ] = "-1"

Random sample from DataFrame

Basic random sample from DataFrame can use ".sample( n = <sample_size> )":

current_bin_sample_df = current_bin_df.sample( n = sample_this_many )

Can also sample a percentage of the total size of the data frame using "frac" parameter (50% sample):

current_bin_sample_df = current_bin_df.sample( frac = 0.5 )

More info:

Using iloc

iloc[] lets you filter a DataFrame or Series on values in the structure's index. You can ask for a particular row by passing an index value.

use square brackets, not parentheses, to hold arguments to iloc (example of passing an index).

Yes:

  user_id = reliability_names_df_IN[ column_name ].iloc[ 0 ]

No:

  user_id = reliability_names_df_IN[ column_name ].iloc( 0 )

Using loc

loc[] lets you filter rows in a DataFrame or Series on values within the structure. It accepts two arguments:
- specification of which row(s) you want to target. This could be a list of indexes, a Python slice statement, or it could be a conditional statement like you'd pass to an if statement.
- specification of which column(s) within matching rows you want to target. This could be a list of column names, or it could be a conditional statement like you'd pass to an if statement (though I'm honestly not sure how a filter here would work).
Use ".notnull()" to find just rows where a given column is not null:
```
  df.loc[ df[ "column_name" ].notnull() ]
```

Just columns 1, 2, and 3 where a given column is not null:

  df.loc[ df[ "column_name" ].notnull(), [ "column1", "column2", "column3" ] ]

Use ".isnull()" to find rows where a column's value is null:
```
  df.loc[ df[ "column_name" ].isnull() ]
```
example column "code_value", using results of .loc[] to update value of that column:
```
  df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
```
- converts column code_value to int
```
  df[ "code_value" ].astype( int )
```
- filters rows to only those where the value is <= 0:
```
  df[ "code_value" ].astype( int ) <= 0
```
- after a comma, specify the columns you want to interact with in rows that fit your filter criteria (just "code_value", same column we are filtering on):
```
  df[ "code_value" ].astype( int ) <= 0, [ "code_value" ]
```
- pass this all to .loc[] so it will do all the filtering of rows and targeting of columns:
```
  df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ]
```
- then, assign "new_value" to the result of the .loc[] (the column "code_value" in each matching row):
```
  df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
```

another example of assigning a value to the filtering that loc lets you do:

  cleaned_data_frame.loc[ cleaned_data_frame[ "org_dept" ] == "ANESTHESIOLOGY", [ 'org_dept' ] ] = "MEDICINE"

add a column:

  df.loc[ :, "column_name" ] = column_values

pandas doc: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

Replace

# replace all rows with "nan" with ""
df[ column_name ].replace( to_replace = "nan", value = "", inplace = True )

Replace with regex

# For all values that start with "b'" and end with "'", strip
#     that stuff off.
df[ column_name ].replace( to_replace = r"^b'(.*)'$", value = r"\1", regex = True, inline = True )

Replace using loc

assigning a value to the filtering that loc lets you do:

  cleaned_data_frame.loc[ cleaned_data_frame[ \"org_dept\" ] == \"ANESTHESIOLOGY\", [ \'org_dept\' ] ] = \"MEDICINE\"

See "Using loc" above for more details.

Import/export

DataFrame.to_csv()

df.to_csv( file_path, index = False, encoding = \"utf-8\")

from: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Options for outputting CSV to table

Use psycopg2 copy_from:

import io

sqlalchemy_conn = sqlalchemy_engine.raw_connection()
sqlalchemy_cursor = sqlalchemy_conn.cursor()

# make string buffer
df_buf = io.StringIO()

# store data frame there as CSV
resdf.to_csv( df_buf )

# reset buffer to the beginning
df_buf.seek( 0 )

# write to table.
sqlalchemy_cursor.copy_from( df_buf, table_name, sep = "," )
sqlalchemy_conn.commit()

Create INSERTS:

# df is the dataframe
if len(resdf) > 0:

    df_columns = list( resdf )

    # create (col1,col2,\...)
    columns = ",".join(df_columns)

    # create VALUES( '%s', '%s', ... ) one '%s' per column
    values = "VALUES( {} )".format( ",".join( [ "%s" for _ in df_columns ] ) )

    # create INSERT INTO table ( columns ) VALUES( '%s', ... )
    insert_stmt = "INSERT INTO {} ({}) {}".format( table_name, columns, values )

    print( "INSERT statement: " + insert_stmt )

    sqlalchemy_conn = sqlalchemy_engine.raw_connection()
    sqlalchemy_cursor = sqlalchemy_conn.cursor()
    psycopg2.extras.execute_batch( sqlalchemy_cursor, insert_stmt, resdf.values )
    sqlalchemy_conn.commit()
    sqlalchemy_cursor.close()

#-- END check if anything in data frame --#

Parsing strings

How to do left(), right(), and mid() in pandas:
- left()-most 2 characters: df[ 'column_name' ].str[ : 2 ]
- right()-most 4 characters: df[ 'column_name' ].str[ -4 : ]
- mid() 2 and 3: df[ 'column_name' ].str[ 2 : 4 ]
- http://stackoverflow.com/questions/20970279/how-to-do-a-left-right-and-mid-of-a-string-in-a-pandas-dataframe#20970328
to split a date string (MMDDYYYY) out into separate date part columns:
- If column is an integer, convert it to string:
```
  temp_series = df[ "<column_name>" ].apply( str )
  # OR slower...
  temp_series = df[ "<column_name>" ].astype( str )
```
- zero-pad to 8 digits:
```
  temp_series = temp_series.str.zfill( 8 )
```
- parse out date parts:
  - month - left()-most 2 characters: df[ 'column_name' ].str[ : 2 ]
  - day - mid() 2 and 3: df[ 'column_name' ].str[ 2 : 4 ]
  - year - right()-most 4 characters: df[ 'column_name' ].str[ -4 : ]

Performance

Process large datasets without running out of memory: https://pythonspeed.com/memory/

dask

alternative to pandas that is not so dependent on memory.
Links:
- https://dask.pydata.org/en/latest/
- http://pythondata.com/dask-large-csv-python/

pandas and large data sets

Wes McKinney - pandas creator:
- Pandas and apache arrow: http://wesmckinney.com/blog/pandas-and-apache-arrow/
- overview of issues with pandas from its creator: http://wesmckinney.com/blog/apache-arrow-pandas-internals/
- using pandas with big data from its creator: https://www.dataquest.io/blog/pandas-big-data/
- apache arrow:
  - http://wesmckinney.com/blog/arrow-columnar-abadi/
  - http://wesmckinney.com/blog/apache-arrow-pandas-internals/
Dataquest:
a few basic pandas memory hints: https://stackoverflow.com/questions/31886939/reduce-memory-usage-pandas#31888262
Not sure how potentially good this is: http://www.hilpisch.com/YH_In_Memory_Analytics_with_pandas.html/
basic tips, including dictionaries over functions for recoding and categorical variables: https://michaelinkles.wordpress.com/2016/04/17/reducing-memory-usage-in-pandas/
big data with pandas and sqlite: https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/

function cleanData()

def cleanData( data_frame_IN, column_name_IN, filter_value_IN ):

    """
    Parameters
    ----------
    - data_frame_IN : A pandas DataFrame
    - column_name_IN : Name of the column on the dataframe
    - filter_value_IN : The value that causes rows to be filtered out of data_frame_IN
           if it is present in the column named column_name_IN.

    Returns
    -------
    - cleaned_data_OUT : A Pandas DataFrame containing only rows that did not have the
                specified value in the specified column
    """

    # return reference
    cleaned_data_OUT = None

    # check to make sure that column name is in data frame's list of column names.
    if( column_name_IN not in list( data_frame_IN.columns.values ) ):

        print( "ERROR : Column you specified not present in the dataframe" )
        clean_data_OUT = None

    #-- END check to see if column is in data frame's list of column names. --#

    if filter_value_IN.upper() == "NULL":

        # keep rows where column passed is not NULL.
        cleaned_data_OUT = data_frame_IN[ pandas.notnull( data_frame_IN[ column_name_IN ] ) == True ]

    else:

        # keep rows where column passed in does not contain filter_out_value_IN.
        cleaned_data_OUT = data_frame_IN[ data_frame_IN[ column_name_IN ] != filter_value_IN ]

    #-- END check for "NULL" --#

    return cleaned_data_OUT

#-- END function cleanData() --#

print( "function cleanData() defined at \" + str( datetime.datetime.now() ) )

Dummy-coding data

pandas.get_dummies(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

With scikit-learn

train-test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Cheat Sheet

https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly