# Compare Stripe Vs CM

#### Specifications

**Stripe Vs CM**

Input files
1. Cleaned Stripe file
2. Segregated CM file

Insights on Input data
1. Stripe raw data has no duplicate entries for ChargeID. It has summed up values as 'gross'
2. CM data has multiple transactions with the same ChargeID.
3. In CM file, sometimes the same ChargeID appears separately in Monthly and Yearly files of CM and hence while clubbing we end up having two records for the same ChargeID with different Plan Types.

Comparison
Step - 01
Group CM data by ChargeID by summing up "Line Item Value Account Currency" and then compare it with Stripe's 'Gross'.


### Script
---

#### Imports,prepartions and functions

In [11]:
import pandas as pd
import os
!pip install XlsxWriter
import xlsxwriter


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


###### Preparations
---


In [12]:
#Mount GDrive
from google.colab import drive
drive.mount('/content/gdrive')

#Set file paths
path = "/content/gdrive/My Drive/Data Science 2022/CM Audit/"

filestripe = path + "Stripe Data/Stripe Raw Data - 2022.09.xlsx"
filecm  =path + "Monthly Output/2022.09/CM_2022.09.xlsx"

outfile = path + "Monthly Output/2022.09/Comparison/StripeVsCM_202209.xlsx"

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


#### Compare Stripe Vs CM 


In [18]:
lstType = ['dispute','dispute_reversal','fee']

#Load file
if filestripe[-3:] == "csv": 
  month = filestripe[-9:-4]
  df = pd.read_csv(filestripe , na_filter=False, index_col=False)
else:
  month = filestripe[-10:-5]
  df = pd.read_excel(filestripe , na_filter=False, index_col=False)

############ Segregate Non-Revenue transactions based on 'Type'
#Move the non revenue 'reporting_category' to seperate sheet - Non revenue

dfmove = df[df['reporting_category'].isin(lstType)] #dataset to be moved to 'others' sheet
dfmove.insert(dfmove.shape[1],"Plan Type","")
dfclean = df[~df['reporting_category'].isin(lstType)] #clean dataset

dfs = dfclean.copy()

dfcm0 = pd.read_excel(filecm,sheet_name="Stripe" , na_filter=False, index_col=False)

#Group CM data by 'ChargeId' summ up 'Line Item Value Account Currency'
dfcm = dfcm0.groupby(['Charge ID'], as_index = False).agg({'Line Item Value Account Currency':'sum','Customer External ID':'first','Plan Type':'first'})

dfmon = pd.merge(dfs, dfcm, right_on="Charge ID", left_on="source_id", how='left')

#Keep only stripe data along with plan
cmcols = dfcm.columns.tolist()
cmcols.remove('Plan Type')
cmcols.remove('Line Item Value Account Currency')

mcols = dfmon.columns.tolist()
newcols = [x for x  in mcols if x not in cmcols  ]
dfnews = dfmon[newcols]

#Delete duplicate rows
dfnews.drop_duplicates(inplace=True)

'''
#Look for duplicate ChargeIDs. 
lstduplicate = dfnews[dfnews.duplicated('source_id')].source_id.to_list()
#For those chargeIDs find corresponding CM record with "Line Item Type" = 'subscription'. Get 'plan type' from that transaction
#Get all records from CM that has chargeid = dfduplicate['source_id']
dfcm_match = dfcm[dfcm['Charge ID'].isin(lstduplicate)]
dfcm_subs = dfcm_match[dfcm_match["Line Item Type"] == 'subscription']
#Iterate cm data to find plan type
lsthit=[]
for i,row in dfcm_subs.iterrows():
  for item in lstduplicate:
    if row['Charge ID'] == item:
      dfnews.loc[dfnews['source_id'] == item, ['Plan Type']] = row['Plan Type']
      dfnews.loc[dfnews['source_id'] == item, ['Line Item Value Account Currency']] = row['Line Item Value Account Currency']
      lsthit.append(item)
#There are records with same customer id repeated twice. Sort out those
lstcus_duplicates =[ x for x in lstduplicate  if x not in lsthit  ]

#Remove those records from dfnews where chargeID is not in lsthit
for item in lstcus_duplicates:
  dfnews.drop(dfnews.index[(dfnews['source_id'] == item) & (dfnews['Plan Type'] == 'Yearly')],inplace=True)

dfnews.drop_duplicates(inplace=True)
'''

#Rearrange columns so that 'gross' and 'Line Item Value Acc Curr' appears at last.
new_cols = [col for col in dfnews.columns if col != 'gross'] + ['gross']
dfnews = dfnews[new_cols]
new_cols1 = [col for col in dfnews.columns if col != 'Line Item Value Account Currency'] + ['Line Item Value Account Currency']
dfnews = dfnews[new_cols1]

#Add difference column
diff = dfnews['gross'] - dfnews['Line Item Value Account Currency']
dfnews['Difference'] = diff

dfnews.rename(columns={'Line Item Value Account Currency':'From CM data'}, inplace=True)

#Sort by plan type
dfnews = dfnews.sort_values('Plan Type',ascending=False)

writer = pd.ExcelWriter(outfile, engine='xlsxwriter')
dfnews.to_excel(writer,sheet_name="Stripe Vs CM", index=False)
dfmove.to_excel(writer,sheet_name="Non Revenue records", index=False)

writer.save()
writer.close()

  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
  warn("Calling close() on already closed file.")
