### APPF3 | Spring Semester 2020

# Using Pandas to Get More out of Data
## Pandas
* Pandas is a newer package built on top of NumPy
* NumPy is very useful for numerical computing tasks* 
* Pandas allows more flexibility: Attaching labels to data, working with missing data, etc.

In [1]:
%autosave 30
import pandas as pd
pd.__version__

Autosaving every 30 seconds


'1.0.3'

In [2]:
import numpy as np # We will need NumPy throughout this course day
np.__version__

'1.18.4'

## The Pandas Objects
* Pandas objects are enhanced versions of NumPy arrays: The rows and columns are identified with labels rather than simple integer indices
* `Series` object: A one-dimensional array of indexed data
* `DataFrame` object: A two-dimensional array with both flexible row indices and flexible column names

## The Pandas `Series` Object
* A Pandas `Series` object is a one-dimensional array of indexed data
 * NumPy array: has an _implicitly_ defined integer index
 * A `Series` object uses by default integer indices:

In [None]:
data1 = pd.Series([100,200,300])
data1

* A `Series` object can have an _explicitly_ defined index associated with the values:

In [None]:
data2 = pd.Series([100,200,300], index=["a","b","c"])
data2

* We can access the index labels by using the `index` attribute:

In [None]:
d2ind = data2.index
d2ind

* A Python dictionary maps arbitrary keys to a set of arbitrary values
* A `Series` object maps _typed_ keys to a set of _typed_ values
 * "Typed" means we know the type of the indices and elements beforehand, making Pandas Series objects much more efficient than Python dictionaries for certain operations
* We can construct a `Series` object directly from a Python dictionary:

In [None]:
data_dict = pd.Series({"c":123,"a":30,"b":100})
data_dict

## The Pandas `DataFrame` Object
* A `DataFrame` object is an analog of a two-dimensional array both with flexible row indices and flexible column names
 * Both the rows and columns have a generalized index for accessing the data
 * The row indices can be accessed by using the `index` attribute
 * The column indices can be accessed by using the `columns` attribute
 
## Constructing `DataFrame` Objects
* You can think of a `DataFrame` as a sequence of aligned `Series` objects, meaning that each column of a `DataFrame` is a `Series`

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

In [None]:
population

In [None]:
area

In [None]:
states = pd.DataFrame({'population': population,'area': area})
states

In [None]:
states.index

In [None]:
states.columns

* There are multiple ways to construct a `DataFrame` object
 * From a single `Series` object

In [None]:
pd.DataFrame(population, columns=["population"])

 * From a list of dictionaries:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

 * From a dictionary of `Series` objects:

In [None]:
pd.DataFrame({'population': population, 'area': area})

 * From a two-dimensional NumPy array:

In [None]:
rng = np.random.RandomState(0) # Ensure that the same random numbers are generated each time we run this code
pd.DataFrame(rng.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

## Data Selection in `Series`

`Series` as a dictionary: 
 * Select elements by key, e.g. `data['a']`
 * Modify the `Series` object with familiar syntax, e.g. `data['e'] = 100`
 * Check if a key exists by using the `in` operator
 * Access all the keys by using the `keys()` method
 * Access all the values by using the `items()` method

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
list(data.items())

In [None]:
data['e'] = 1.25
data

* `Series` as one-dimensional array: 
 * Select elements by the implicit integer index, e.g. `data[0]`
 * Select elements by the explicit index, e.g. `data['a']`
 * Select slices (by using an implicit integer index or an explicit index)
   * _Important_: Slicing with an explicit index (e.g., `data['a':'c']`) will _include_ the final index in the slice, while slicing with an implicit index (e.g., `data[0:3]`) will _exclude_ the final index from the slice
 * Use masking operations, e.g., `data[data < 3]`

In [None]:
data['a':'c'] # Slicing by explicit index

In [None]:
data[0:2] # Slicing by implicit index

In [None]:
data[(data > 0.3) & (data < 0.8)] # Masking operation

## Data Selection in `DataFrame`
* `DataFrame` as a dictionary of related `Series` objects: 
 * Select `Series` by the column name, e.g. `df['area']`
 * Modify the `DataFrame` object with familiar syntax, e.g. `df['c3'] = df['c2']/ df['c1']`

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})

population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127, 'Florida': 19552860,
                        'Illinois': 12882135})

data = pd.DataFrame({'area': area, 'population': population})
data

In [None]:
data[['area', 'population']]

In [None]:
data['density'] = data['population'] / data['area']
data

* `DataFrame` as two-dimensional array: 
 * Access the underlying NumPy data array by using the `values` attribute
   * `df.values[0]` will select the first row
 * Use the `iloc` indexer to index, slice, and modify the data by using the implicit integer index
 * Use the `loc` indexer to index, slice, and modify the data by using the explicit index

In [None]:
data.values

In [None]:
data.values[0]

In [None]:
data['area']

In [None]:
data.iloc[:3, :2] # Use implicit indices

In [None]:
data.loc[:'Illinois', :'population'] # Use explicit indices

In [None]:
data.iloc[0, 2] = 90
data

## Ufuncs and Pandas
* Pandas is designed to work with Numpy, thus any NumPy ufunc will work on Pandas Series and `DataFrame` objects
* _Index preservation_: Indices are preserved when a new Pandas object will come out after applying ufuncs
* _Index alignment_: Pandas will align indices in the process of performing an operation
 * Missing data is marked with `NaN` ("Not a Number")
 * We can specify on how to fill value for any elements that might be missing by using the optional keyword fill_value: `A.add(B, fill_value=0)`
 * We can also use the `dropna()` method to drop missing values
* _Note_: Any of the ufuncs discussed for NumPy can be used in a similar manner with Pandas objects

### Ufuncs: Index Preservation

In [3]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [4]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [5]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [6]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


### Ufuncs: Index Alignment

In [7]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')


In [8]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [9]:
(population / area).dropna()

California    90.413926
Texas         38.018740
dtype: float64

In [10]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [11]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [12]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


In [13]:
A.add(B)

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


## Ufuncs: Operations Between DataFrame and Series
* Operations between a `DataFrame` and a `Series` are similar to operations between a two-dimensional and one-dimensional NumPy array (e.g., compute the difference of a two-dimensional array and one of its rows)

In [14]:
rng = np.random.RandomState(2)
A = rng.randint(10, size=(3, 4))
A

array([[8, 8, 6, 2],
       [8, 7, 2, 1],
       [5, 4, 4, 5]])

In [15]:
A - A[0] # Subtract the first row of A from A itself

array([[ 0,  0,  0,  0],
       [ 0, -1, -4, -1],
       [-3, -4, -2,  3]])

In [16]:
B = rng.randint(10, size=(3, 4))
B

array([[7, 3, 6, 4],
       [3, 7, 6, 1],
       [3, 5, 8, 4]])

In [17]:
df = pd.DataFrame(B, columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,7,3,6,4
1,3,7,6,1
2,3,5,8,4


In [18]:
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-4,4,0,-3
2,-4,2,2,0


In [19]:
df.subtract(df['R'], axis=0) # We can also subtract column-wise

Unnamed: 0,Q,R,S,T
0,4,0,3,1
1,-4,0,-1,-6
2,-2,0,3,-1


## Reading (and Writing) Data with Pandas
### File Types
* We will work with _plaintext files_ only in this session; these contain only basic text characters and do not include font, size, or color information
 * _Binary files_ are all other file types, such as PDFs, images, executable programs etc.
 
### The Current Working Directory
* Every program that runs on your computer has a _current working directory_
 * It's the directory from where the program is executed / run
 * _Folder_ is the more modern name for a directory
* The _root_ directory is the top-most directory and is addressed by `/` 
 * A directory `mydir1` in the root directory can be addressed by `/mydir1`
 * A directory `mydir2` within the `mydir1` directory can be address by `/mydir/mydir2`, and so on
 
### Absolute and Relative Paths
* An _absolute path_ begins always with the root folder, e.g. `/my/path/...`
* A _relative path_ is always relative to the program's current working directory
 * If a program's current working directory is `/myprogram` and the directory contains a folder files with a file `test.txt`, then the relative path to that file is just `files/test.txt` 
 * The absolute path to `test.txt` would be `/myprogram/files/test.txt` (note the root folder `/`)

In [None]:
!ls # List folder content for current working directory

In [None]:
!pwd # Print path to the current working directory

### Reading Data with Pandas
* Pandas provides the `pandas.read_csv()` function to load data from a CSV file (or a file that uses a different delimiter than a comma)
 * The path you specify doesn't have to be on your hard disk; you can also provide the URL to a CSV file to read it directly into a Pandas object
 * We can set the optional argument `error_bad_lines` to `False` so that bad lines in the file get omitted and do not cause an error
 * Checkout the documentation to learn more about the optional arguments:<br>https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
 
**Planets Data**: The _Planets_ dataset (available from the Seaborn package or the [`seaborn-data` repository](https://github.com/mwaskom/seaborn-data)) gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). The file contains details on the 1000+ exoplanets discovered up to 2014.


In [20]:
planets = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/planets.csv")

In [21]:
planets.shape

(1035, 6)

In [22]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [23]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


### Some Interesting Data Sources
* Federal Statistical Office: https://www.bfs.admin.ch/bfs/en/home/statistics/catalogues-databases/data.html 
* OpenData: https://opendata.swiss/en/ 
* United Nations: http://data.un.org/ 
* World Health Organization: http://apps.who.int/gho/data/node.home 
* World Bank: https://data.worldbank.org/ 
* Kaggle: https://www.kaggle.com/datasets 
* Cern: http://opendata.cern.ch/
* Nasa: https://data.nasa.gov/ 
* FiveThirtyEight: https://github.com/fivethirtyeight/data 

## Aggregating and Grouping Data in Pandas

### Simple Aggregation in Pandas
* As with one-dimensional NumPy array, for a Pandas `Series` the aggregates return a single value
* For a `DataFrame`, the aggregates return by default results within each column
* Pandas `Series` and `DataFrames` include all of the common NumPy aggregates
 * In addition, there is a convenience method `describe()` that computes several common aggregates for each column and returns the result

In [24]:
rng = np.random.RandomState(3)
ser = pd.Series(rng.rand(5))
ser

0    0.550798
1    0.708148
2    0.290905
3    0.510828
4    0.892947
dtype: float64

In [25]:
ser.sum()

2.9536250236509423

In [26]:
ser.mean()

0.5907250047301884

In [27]:
df = pd.DataFrame({'A': rng.rand(5), 'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.896293,0.029876
1,0.125585,0.456833
2,0.207243,0.649144
3,0.051467,0.278487
4,0.44081,0.676255


In [28]:
df.mean()

A    0.344280
B    0.418119
dtype: float64

In [29]:
df.mean(axis='columns')

0    0.463085
1    0.291209
2    0.428193
3    0.164977
4    0.558532
dtype: float64

In [30]:
df.describe()

Unnamed: 0,A,B
count,5.0,5.0
mean,0.34428,0.418119
std,0.341461,0.270062
min,0.051467,0.029876
25%,0.125585,0.278487
50%,0.207243,0.456833
75%,0.44081,0.649144
max,0.896293,0.676255


### Split, Apply, Combine
* _Split_: Break up and group a DataFrame depending on the value of the specified key
* _Apply_: Apply some function, usually an aggregate, transformation, or filtering, within the individual groups
* _Combine_: Merge the results of these operations into an output array

### The `GroupBy` Object
* The `groupBy()` method returns a `DataFrameGroupBy`: It's a special view of the `DataFrame`
 * Helps get information about the groups, but does no actual computation until the aggregation is applied ("lazy evaluation", i.e. evaluate only when needed)
 * Apply an aggregate to this `DataFrameGroupBy` object: This will perform the appropriate apply/combine steps to produce the desired result
   * You can apply any Pandas or NumPy aggregation function
 * Other important operations made available by a `GroupBy` are _filter_, _transform_, and _apply_


In [31]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(1,7)})
df

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [32]:
groupby_key = df.groupby('key')
groupby_key.groups

{'A': Int64Index([0, 3], dtype='int64'),
 'B': Int64Index([1, 4], dtype='int64'),
 'C': Int64Index([2, 5], dtype='int64')}

In [33]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,5
B,7
C,9


### Column Indexing and Iterating Over Groups
* The `GroupBy` object supports column indexing in the same way as the `DataFrame`, and returns a modified `GroupBy` object
* The `GroupBy` object also supports direct iteration over the groups, returning each group as a Series or `DataFrame`

In [42]:
# Display groups 
planets.groupby('method').groups.keys()

dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])

In [54]:
# Which is the most recent method?
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


In [36]:
# 
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [None]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

### Aggregate, Filter, Transform, and Apply
* _Aggregate_: The `aggregate()` method can compute multiple aggregates at once
* _Filter_: The `filter()` method allows you to drop data based on group properties
 * _Note_: `filter()` takes as an argument a function that returns a Boolean value specifying whether the group passes the filtering
* _Transformation_: While aggregation must return a reduced version of the data, `transform()` can return some transformed version of the full data to recombine (meaning that we still have the same number of entries before and after the transformation)
* _Apply_: The `apply()` method lets you apply an arbitrary function to the group results (or even to `DataFrame`s in general). The arbitrary function should take a `DataFrame`, and return either a Pandas object or a scalar

In [56]:
rng = np.random.RandomState(4)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6),
                   'price': ['$ 123.00', '$ 112.00', '$ 123.00', '$ 12.32', '$ 14.32', '$ 0.123']})
df

Unnamed: 0,key,data1,data2,price
0,A,0,7,$ 123.00
1,B,1,5,$ 112.00
2,C,2,1,$ 123.00
3,A,3,8,$ 12.32
4,B,4,7,$ 14.32
5,C,5,8,$ 0.123


In [57]:
# Multiple aggregates per column
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,7,7.5,8
B,1,2.5,4,5,6.0,7
C,2,3.5,5,1,4.5,8


In [73]:
# Keep all groups for which the standard deviation is greater than 4
def filter_func(x):
    # x is a DataFrame of group values
    return x['data2'].std() > 4

In [71]:
display(df, df.groupby('key').std())

Unnamed: 0,key,data1,data2,price
0,A,0,7,123.0
1,B,1,5,112.0
2,C,2,1,123.0
3,A,3,8,12.32
4,B,4,7,14.32
5,C,5,8,0.123


Unnamed: 0_level_0,data1,data2,price
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,2.12132,0.707107,78.262579
B,2.12132,1.414214,69.07019
C,2.12132,4.949747,86.88716


In [74]:
# Filtering: Select data based on group properties
df.groupby('key').filter(filter_func) 

Unnamed: 0,key,data1,data2,price
2,C,2,1,123.0
5,C,5,8,0.123


In [75]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x["data1"] /= x["data2"].sum()
    return x

display(df, df.groupby("key").apply(norm_by_data2))

Unnamed: 0,key,data1,data2,price
0,A,0,7,123.0
1,B,1,5,112.0
2,C,2,1,123.0
3,A,3,8,12.32
4,B,4,7,14.32
5,C,5,8,0.123


Unnamed: 0,key,data1,data2,price
0,A,0.0,7,123.0
1,B,0.083333,5,112.0
2,C,0.222222,1,123.0
3,A,0.2,8,12.32
4,B,0.333333,7,14.32
5,C,0.555556,8,0.123


In [76]:
def cleanup_price(x):
    return float(x[2:])

df["price"] = df["price"].apply(cleanup_price)
df

TypeError: 'float' object is not subscriptable

### Transform: an Example Based on Sales Data
Source: http://pbpython.com/pandas_transform.html

In [77]:
sales = pd.read_csv("datasets/sales_transactions.csv")

In [78]:
# Order consists of multiple products (skus)
sales.head()

Unnamed: 0,account,name,order,sku,quantity,unit price
0,383080,Will LLC,10001,B1-20000,7,33.69
1,383080,Will LLC,10001,S1-27722,11,21.12
2,383080,Will LLC,10001,B1-86481,3,35.99
3,412290,Jerde-Hilpert,10005,S1-06532,48,55.82
4,412290,Jerde-Hilpert,10005,S1-82801,21,13.62


In [79]:
sales["cost"] = sales["quantity"] * sales["unit price"]

In [80]:
sales.head()

Unnamed: 0,account,name,order,sku,quantity,unit price,cost
0,383080,Will LLC,10001,B1-20000,7,33.69,235.83
1,383080,Will LLC,10001,S1-27722,11,21.12,232.32
2,383080,Will LLC,10001,B1-86481,3,35.99,107.97
3,412290,Jerde-Hilpert,10005,S1-06532,48,55.82,2679.36
4,412290,Jerde-Hilpert,10005,S1-82801,21,13.62,286.02


In [81]:
# Goal: Compute how much a product's price contributes to the order's total
groupby_order = sales.groupby('order')
sales["order total"] = groupby_order["cost"].transform(np.sum)

In [82]:
sales.head()

Unnamed: 0,account,name,order,sku,quantity,unit price,cost,order total
0,383080,Will LLC,10001,B1-20000,7,33.69,235.83,576.12
1,383080,Will LLC,10001,S1-27722,11,21.12,232.32,576.12
2,383080,Will LLC,10001,B1-86481,3,35.99,107.97,576.12
3,412290,Jerde-Hilpert,10005,S1-06532,48,55.82,2679.36,8185.49
4,412290,Jerde-Hilpert,10005,S1-82801,21,13.62,286.02,8185.49


In [83]:
sales["percentage"] = sales["cost"] / sales["order total"] * 100

In [84]:
sales.head()

Unnamed: 0,account,name,order,sku,quantity,unit price,cost,order total,percentage
0,383080,Will LLC,10001,B1-20000,7,33.69,235.83,576.12,40.93418
1,383080,Will LLC,10001,S1-27722,11,21.12,232.32,576.12,40.324932
2,383080,Will LLC,10001,B1-86481,3,35.99,107.97,576.12,18.740887
3,412290,Jerde-Hilpert,10005,S1-06532,48,55.82,2679.36,8185.49,32.733043
4,412290,Jerde-Hilpert,10005,S1-82801,21,13.62,286.02,8185.49,3.494232
