# Transition to FracFocus version 4

In Dec 2023, FracFocus changed the format (and to a small degree the content) of the bulk download.

These changes necessitate some changes in Open-FF.  The following code creates the bridge between the old and new formats and describes the process.

## Transforming the 'upload_dates' file
One large change in FF was to stop using `UploadKey` as the primary key for disclosures (which had been stabel for at least 5 years) and replace it with a field named `DisclosureId`.  Even though the two are of the same format, etc., they are not same values for a given disclosure.  Because no bulk download has both fields, we must create a map of old to new keys if we are to continue to use the archived data (now 5 years of it).  

In addition, crossing the transition can be tricky if we want to be able to identify "new" disclosures.  The plan to accomplish this is:
1. use the new archive (Dec 4, 2023) to create a new baseline upload_dates file.  
1. take the old archive (Nov 25, date of last full repo) and remove duplicates (but keep one - it will be connected to the DisclosureId) to make old_df.  Those duplicates will be based on: ['APINumber','JobEndDate','JobStartDate','OperatorName','TotalBaseWaterVolume']
1. use old_df to connect `UploadKey`s to the `DisclosureId`.  This should keep all `DisclosureId`s (even the duplicates (though they will point to the same `UploadKey`).  
1. use the old repo to populate the publish date in the new baseline update_dates file.  
1. I probably want to add any `UploadKey`s that are not in the new baseline to make it backward compatible. `DisclosureId` will be NaN.
1. Remove the disclosures in the baseline that have ONLY a `DisclosureId` - these are the legitimately new disclosures and will be added in the next build.

These changes will be made locally since they only need to be applied once.  I won't pollute the github space with these data files.

In [None]:
import pandas as pd

old_fn = r"C:\MyDocs\OpenFF\src\openFF-integrated\tmp\all_meta\ff_archive_meta_2023-11-25.parquet"
new_fn = r"C:\MyDocs\OpenFF\src\openFF-integrated\tmp\all_meta\ff_archive_meta_2023-12-04.parquet"
# see "C:\MyDocs\OpenFF\src\openFF-integrated\build_meta.ipynb" for how these summary files were created

old_upload_dates_fn = r"C:\MyDocs\integrated\repos\openFF_data_2023_11_25\curation_files\upload_dates.parquet"


In [None]:
# step 1 and 2
newdf = pd.read_parquet(new_fn)
olddf = pd.read_parquet(old_fn)
# remove duplicates except a single copy
olddf = olddf[~(olddf.duplicated(keep='last',
                              subset=['APINumber','JobEndDate',
                                      'JobStartDate','OperatorName',
                                      'TotalBaseWaterVolume']))]
print(f'Len old (no dupes): {len(olddf)},  Len new: {len(newdf)}')
olddf['UploadKey'] = olddf.pKey
print(olddf.columns)

In [None]:
# step 3
mg = pd.merge(newdf, olddf[['UploadKey','APINumber','JobEndDate','JobStartDate','OperatorName','TotalBaseWaterVolume']],
             on = ['APINumber','JobEndDate','JobStartDate','OperatorName','TotalBaseWaterVolume'],
             how='outer',indicator=True, validate='m:1')
mg._merge.value_counts()

In [None]:
# trim to just the inner join
mg = mg[mg._merge=='both']
upk = mg.UploadKey.unique().tolist()
mg.drop('_merge',axis=1,inplace=True)
# get old upload_dates
old_upl = pd.read_parquet(old_upload_dates_fn)
old_upl.columns

In [None]:

mg = pd.merge(mg,old_upl, on='UploadKey',how='outer',indicator=True,validate='m:1')
mg._merge.value_counts()

In [None]:
mg = mg[['DisclosureId','UploadKey','date_added','num_records','weekly_report']] 
mg.head(2)

In [None]:
# save to use as upload_dates.parquet in next build (transfer to orig_dir/curation_files)
mg.to_parquet('./sandbox/upload_dates.parquet')

---
# Preparing the dup_rec map for the disclosures before FFV4
FF changed the format of "empty" Supplier, Purpose and TradeName fields which broke the way that I detected duplicated records.  While I wait (forever?)  for a solution from FracFocus to deal with such problems, I make, here, a file that can be used in later builds to still remove those records detected pre-FFV4.

The file will have the DisclosureID and the necessary other fields needed to identify the already detected duplicates.

Note that because of disclosure duplicates (which are filtered in the standard set) we delete the duplicate DisclosureId.

In [None]:
mg = pd.read_parquet('./sandbox/upload_dates.parquet')

fn = r"C:\MyDocs\integrated\repos\openFF_data_2023_11_25\full_df.parquet"
df = pd.read_parquet(fn, columns=['UploadKey','CASNumber','IngredientName','PercentHFJob',
                                  'PercentHighAdditive','MassIngredient','dup_rec'])
df = df[df.dup_rec]
out = pd.merge(df,mg[~mg.UploadKey.duplicated()][['UploadKey','DisclosureId']],on='UploadKey',how='left',validate='m:1')
out.to_parquet('./sandbox/trans_dup_res.parquet')

In [None]:
fn = r"C:\MyDocs\integrated\openFF\build\sandbox\final\full_df.parquet"
df = pd.read_parquet(fn)
out.rename({'dup_rec':'old_dup_rec'},axis=1,inplace=True)
out.drop('UploadKey',axis=1,inplace=True)
mg = pd.merge(df,out[~out.duplicated()],on=['DisclosureId','CASNumber','IngredientName','PercentHFJob',
                         'PercentHighAdditive','MassIngredient'],how='outer',validate='m:1',indicator=True)
mg._merge.value_counts()


In [None]:
mg[mg.dup_rec & mg.old_dup_rec].FFVersion_x.value_counts()

## Other changes to FracFocus data

- Some changes to the CASNumber field: it appears that they are now filtering a little bit more (maybe dropping leading zeros?) so that means that some of the CASNumbers I resolved before are showing up again as unresolved.  
- It appears that FF has started to set invalid or out of range lat and lon values to NaN.  This initially broke my reproject code.  I've set them to dummy values (0,0)
- Water source table
- Dropping many unsued or poorly used fields
- The FracFocus README now includes `MassIngredient` suggesting that we can use it as a valid input to our reported masses