A Python 3 noteobook demostratig some pandas basics

In [1]:
import pandas as pd
import numpy as np

In [2]:
# load the data
df = pd.read_csv('Gender Pay Gap.csv')

In [3]:
# find out what was loaded and how much memory it is using
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10508 entries, 0 to 10507
Data columns (total 22 columns):
Employer Name                                     10508 non-null object
Address                                           10508 non-null object
Postcode                                          10508 non-null object
Percent Difference in Mean Hourly Wage            10508 non-null float64
Percent Difference in Median Hourly Wage          10508 non-null float64
Percent Difference in Mean Bonus Received         10508 non-null float64
Percent Difference in Median Bonus Received       10508 non-null float64
Percentage of Males that Received a Bonus         10508 non-null float64
Percentage of Females that Received a Bonus       10508 non-null float64
Proportion of Males in Lower Quartile             10508 non-null float64
Proportion of Females in Lower Quartile           10508 non-null float64
Proportion of Males in Lower Middle Quartile      10508 non-null float64
Proportion of Fema

# Save Some Memory 
Here's a technique to reduce floats to the smallest size possible. (could do the same with ints)

In [5]:
# First, print out the data types for columns that can use a smaller float data type
print('{')
for col in df.select_dtypes(include=['floating']):
    mx = df[col].max()
    mn = df[col].min()   
    if mn > np.finfo(np.float16).min and mx < np.finfo(np.float16).max:
        print("'" + col + "': 'float16',")
    elif mn > np.finfo(np.float32).min and mx < np.finfo(np.float32).max:
        print("'" + col + "': 'float32',")
    elif mn > np.finfo(np.float64).min and mx < np.finfo(np.float64).max:
        print("'" + col + "': 'float64',")
print('}')

{
'Percent Difference in Mean Hourly Wage': 'float16',
'Percent Difference in Median Hourly Wage': 'float16',
'Percent Difference in Mean Bonus Received': 'float16',
'Percent Difference in Median Bonus Received': 'float16',
'Percentage of Males that Received a Bonus': 'float16',
'Percentage of Females that Received a Bonus': 'float16',
'Proportion of Males in Lower Quartile': 'float16',
'Proportion of Females in Lower Quartile': 'float16',
'Proportion of Males in Lower Middle Quartile': 'float16',
'Proportion of Females in Lower Middle Quartile': 'float16',
'Proportion of Males in Upper Middle Quartile': 'float16',
'Proportion of Females in Upper Middle Quartile': 'float16',
'Proportion of Males in Top Quartile': 'float16',
'Proportion of Females in Top Quartile': 'float16',
}


In [6]:
#put these data type definations in a variable
data_type = {
'Percent Difference in Mean Hourly Wage': 'float16',
'Percent Difference in Median Hourly Wage': 'float16',
'Percent Difference in Mean Bonus Received': 'float16',
'Percent Difference in Median Bonus Received': 'float16',
'Percentage of Males that Received a Bonus': 'float16',
'Percentage of Females that Received a Bonus': 'float16',
'Proportion of Males in Lower Quartile': 'float16',
'Proportion of Females in Lower Quartile': 'float16',
'Proportion of Males in Lower Middle Quartile': 'float16',
'Proportion of Females in Lower Middle Quartile': 'float16',
'Proportion of Males in Upper Middle Quartile': 'float16',
'Proportion of Females in Upper Middle Quartile': 'float16',
'Proportion of Males in Top Quartile': 'float16',
'Proportion of Females in Top Quartile': 'float16',
'Submitted After The Deadline': 'float16',
}

In [7]:
#read the csv file using these data type definations
df2 = pd.read_csv('Gender Pay Gap.csv', dtype=data_type)

In [8]:
#Check if it worked and if some memory was saved
df2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10508 entries, 0 to 10507
Data columns (total 22 columns):
Employer Name                                     10508 non-null object
Address                                           10508 non-null object
Postcode                                          10508 non-null object
Percent Difference in Mean Hourly Wage            10508 non-null float16
Percent Difference in Median Hourly Wage          10508 non-null float16
Percent Difference in Mean Bonus Received         10508 non-null float16
Percent Difference in Median Bonus Received       10508 non-null float16
Percentage of Males that Received a Bonus         10508 non-null float16
Percentage of Females that Received a Bonus       10508 non-null float16
Proportion of Males in Lower Quartile             10508 non-null float16
Proportion of Females in Lower Quartile           10508 non-null float16
Proportion of Males in Lower Middle Quartile      10508 non-null float16
Proportion of Fema

You should find it;'s using 6.4 MB rather than 7.2 MB, not much of a saving but the technique can be very useful on larger data sets.