## Challenge #22: Identify Values to Aggregate

Use case:  The log files from an ATM machine have the transaction amounts embedded in text strings.  The user needs to have these text amounts summarized by row (transaction).

Objective: For this assignment the numbers directly following the text ‘ATM2.’ are dollar amounts for transactions. Summarize the values on a row by row basis.

Original challenge: https://community.alteryx.com/t5/Weekly-Challenge/Challenge-22-Identify-Values-to-Aggregate/td-p/36751

In [390]:
import pandas as pd
pd.options.display.max_colwidth = 200

In [391]:
df = pd.read_csv('./challenge_022_input.csv')
df.head()

Unnamed: 0,Field_1
0,v3/ato.495366625/[atm1.1/atm2.39.14]/atc1.CC-270957white/atc2.156309952/[atm1.1/atm2.32.50]/atc1.CC-264289black dots/atc2.156309952/[atm1.1/atm2.19.99]/atc1.CC-286881teal splash/atc2.156309952
1,v3/ato.495846781/[atm1.1/atm2.188]/atc1.CC-289105black/atc2.128497236
2,v3/ato.495554956/[atm1.1/atm2.14.99]/atc1.CC-269604golden leopard/atc2.152224956/[atm1.1/atm2.19.99]/atc1.CC-269603golden leopard/atc2.152224956
3,v3/ato.495716117/[atm1.1/atm2.12]/atc1.CC-286474/atc2.88628621/[atm1.1/atm2.0]/atc1.CC-258242light buff/atc2.88628621/[atm1.1/atm2.3.99]/atc1.CC-272553/atc2.88628621/[atm1.1/atm2.9.99]/atc1.CC-276...
4,v3/ato.496103393/[atm1.1/atm2.5.20]/atc1.CC-259578black/atc2.174643950/[atm1.1/atm2.5.20]/atc1.CC-259578white/atc2.174643950/[atm1.1/atm2.5.20]/atc1.CC-259578buff/atc2.174643950/[atm1.1/atm2.5.20]...


In [392]:
# Extracts all matches after atm2

df = df['Field_1'].str.extractall(r'atm2.(.*?)]')
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,39.14
0,1,32.5
0,2,19.99
1,0,188.0
2,0,14.99


In [393]:
# This step was added after trying to convert column 0 into float 64. This row doesn't follow the usual pattern
# and must be handled differently

df.loc[216]

Unnamed: 0_level_0,0
match,Unnamed: 1_level_1
0,3.989999771118164/atc1.WD-26183326NM/atc2.173528165
1,3.989999771118164/atc1.WD-26183325AM/atc2.173528165
2,3.989999771118164/atc1.WD-282384B59M/atc2.173528165
3,3.989999771118164/atc1.WD-282388BC9M/atc2.173528165
4,3.989999771118164/atc1.WD-26183325AM/atc2.173528165


In [394]:
# The first 4 characters indicate our dollar value

df.loc[216][0] = df.loc[216][0].str[:4]
df.loc[216]

Unnamed: 0_level_0,0
match,Unnamed: 1_level_1
0,3.98
1,3.98
2,3.98
3,3.98
4,3.98


In [395]:
# We use the index to group our records, eg all values where level_0 equals 0 were in row 0 in the original dataset

df.reset_index(level=0, inplace=True)
df.head()

Unnamed: 0_level_0,level_0,0
match,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,39.14
1,0,32.5
2,0,19.99
0,1,188.0
0,2,14.99


In [396]:
# The 0 column is not a string

df.columns

Index(['level_0', 0], dtype='object')

In [397]:
# We use the level_0 column as index as it contains the correct grouping for our rows

df.set_index(keys = 'level_0', inplace = True)
df.head(10)

Unnamed: 0_level_0,0
level_0,Unnamed: 1_level_1
0,39.14
0,32.5
0,19.99
1,188.0
2,14.99
2,19.99
3,12.0
3,0.0
3,3.99
3,9.99


In [398]:
# Column 0 is transformed into float64

df[0] = df[0].astype('float64')

In [399]:
# All values belonging to the same index are summed up.

df = df[0].sum(level = 'level_0')
df

level_0
0       91.63
1      188.00
2       34.98
3       33.96
4       76.00
        ...  
362    284.50
363    159.48
364     42.00
365     49.98
366     42.99
Name: 0, Length: 367, dtype: float64

In [400]:
# The series is transformed back into a DataFrame

df = df.to_frame()
df.head()

Unnamed: 0_level_0,0
level_0,Unnamed: 1_level_1
0,91.63
1,188.0
2,34.98
3,33.96
4,76.0


In [401]:
#The index is reset and column 0 is renamed

df.reset_index(inplace = True)
df.rename(columns={0: 'Sum Dollar Amount'}, inplace = True)
df.head()

Unnamed: 0,level_0,Sum Dollar Amount
0,0,91.63
1,1,188.0
2,2,34.98
3,3,33.96
4,4,76.0


In [402]:
# We drop column level_0 as we already have the index as row indicator

df.drop(labels='level_0', inplace = True, axis=1)
df

Unnamed: 0,Sum Dollar Amount
0,91.63
1,188.00
2,34.98
3,33.96
4,76.00
...,...
362,284.50
363,159.48
364,42.00
365,49.98
