## $DS^2$: Operator Selection

In this notebook, we'll query the `gh-2017` corpus to find the most used data science "operators." An operator, in this context, is either something you call _on a dataframe_ (e.g., `df.drop(...)`) or something you call and _pass a dataframe_ (e.g., `foo.fit_transform(df)`).

There are many other possibile non-call operators. For example, one can filter a dataframe like so: `df[df.x == 12]`. We will also attempt to capture a sampling of such operators in our analysis but, in this notebook, we will stick to understanding what operators (based on calls) are widely used.

In [22]:
%load_ext autoreload
%autoreload 2

from codebook.python import *
from codebook.semantics import DSNotebooks as DSN

# Use the gh-2017 dataset. ~760k unique python notebooks
Evaluator.use_ds_test_1k()

# flows_df = pd.read_csv('/data/gh-2017/results/flows-read-to-fit.csv')
# DSN.set_prefilters(set(flows_df.fpath.unique()))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
tmp = execute(
  use_of(DSN.pandas_read()) 
  compile=True
)

In [2]:
drops, df1, df2, df3 = DSN.drops()

  + Had only 38896 allowable files (pre-filter files)
  + File select time: 13.5941s
  + Found 17226 matching files
  + Query already compiled (cached) `/tmp/queries/1695105906fd5ffdcaeb2de47f4ae07987d0b12d4d852ff044af54071377fc07.dl`
  + Query time: 15.4015s
  + Collation time: 0.2035s
Total time: 29.2116s
  + Had only 16571 allowable files (pre-filter files)
  + File select time: 0.5080s
  + Found 16571 matching files
  + Query already compiled (cached) `/tmp/queries/aa10ea9669be3d8ff9e6e758129fb5eaa06d173c9d2854567fe8e463b8c0f034.dl`
  + Query time: 16.6097s
  + Collation time: 0.4176s
Total time: 17.5447s
  + Had only 16571 allowable files (pre-filter files)
  + File select time: 0.5003s
  + Found 16571 matching files
  + Query already compiled (cached) `/tmp/queries/15a07acf4919d48738ae053a00584c2c0914d4d2088f558a6ed5b97d770ac47c.dl`
  + Query time: 16.4382s
  + Collation time: 0.0806s
Total time: 17.0283s
  + Had only 16571 allowable files (pre-filter files)
  + File select time:

SyntaxError: unexpected character after line continuation character (<string>, line 1)

In [42]:
tmp = df1.set_index('gid_use').join(uses1.set_index('gid_use')[["source_text_cols"]])

In [41]:
tmp[tmp.source_text_cols != tmp.source_text_use]

Unnamed: 0_level_0,fpath,source_text_use,start_line_use,start_col_use,end_line_use,end_col_use,source_text_call,start_line_call,start_col_call,end_line_call,end_col_call,gid_call,source_text_cols
gid_use,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
-9194221465426937159,/cb-target/python/98/56/74cfed73e94c675d8b51a6...,lab,102,41,102,44,"test.drop(lab,1)",102,31,102,47,1436502845706836184,'churndep'
-9163347779137382803,/cb-target/python/05/c0/482c0c73ed954256addb5d...,lab,165,48,165,51,"data_train.drop(lab,1)",165,32,165,54,-5787063165355767296,'pitch_type'
-9155254853313462531,/cb-target/python/8a/bf/dfad71c1ca8f529809cf5e...,lab,28,19,28,22,"train.drop(lab,1)",28,8,28,25,-947904545826073260,'churndep'
-9113105442898604565,/cb-target/python/f9/fd/a9f41294d5c10e58606e5b...,feature,53,21,53,28,"data.drop(feature, axis=1)",53,11,53,37,1685407333614040123,"""Fresh"""
-9081630177581378688,/cb-target/python/6b/d4/f1101e11868ed072ac0608...,target,189,37,189,43,"validation_data.drop(target, axis = 1)",189,16,189,54,-2773968944138403843,'safe_loans'
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9093606210494608052,/cb-target/python/a6/bd/df6387710a08b13112a17c...,dep_var_name,537,26,537,38,"valid_data.drop(dep_var_name, axis=1)",537,10,537,47,1535846630715324820,'Response'
9177857710285211791,/cb-target/python/52/da/27a44d1880cce3b5403ee4...,target,252,66,252,72,"validation_data.drop(target, axis=1)",252,45,252,81,7016841324524147076,'safe_loans'
9182387749492712282,/cb-target/python/5b/f4/c0395e59c6d0a75202d1bb...,response,98,21,98,29,"data.drop(response, axis = 1)",98,11,98,40,9197782229524717749,'Grocery'
9182710990660449973,/cb-target/python/cc/76/d5e2454a30b1900e749a77...,lab,90,33,90,36,"train_df.drop(lab,1)",90,19,90,39,-6615913795226506570,'churndep'


In [2]:
# Let's save these so we can avoid re-querying later...
reads.to_csv('/data/gh-2017/results/pandas-reads-2.csv', encoding='utf-8', index=False)

In [3]:
# Let's look at any call where the _target_ of said call is a use of one of the `pd.read_*` calls
# This should capture things like df.drop(...) or df.head(...)
calls_on_dataframes = execute(
  call() % select('name') % 'call'
  |where| call_target()
    |is_| anything(from_set(DSN.use_of_pandas_read(), 'gid')),
  compile=True
)

  + Had only 181799 allowable files (pre-filter files)
  + File select time: 0.5585s
  + Found 181799 matching files
  + Compile time: 12.8148s
  + Query time: 269.7335s
  + Collation time: 6.6299s
Total time: 291.2666s
  + Had only 181799 allowable files (pre-filter files)
  + File select time: 0.5084s
  + Found 181799 matching files
  + Compile time: 13.2327s
  + Query time: 128.8160s
  + Collation time: 1.4943s
Total time: 152.8976s


In [4]:
# Let's save these so we can avoid re-querying later...
calls_on_dataframes.to_csv('/data/gh-2017/results/calls-on-dataframes-2.csv', encoding='utf-8', index=False)

In [5]:
# Let's look at any call where any _argment_ of said call is a use of one of the `pd.read_*` calls
# This should capture things like pd.merge(...df...) and x.fit_transform(df)
calls_with_dataframes_as_args = execute(
  call() % select('name') % 'call'
  |where| any_arg()
    |is_| anything(from_set(DSN.use_of_pandas_read(), 'gid')),
  compile=True
)

  + Had only 181799 allowable files (pre-filter files)
  + File select time: 0.5004s
  + Found 181799 matching files
  + Compile time: 12.2598s
  + Query time: 123.7710s
  + Collation time: 0.5320s
Total time: 145.8114s


In [6]:
# Let's save these so we can avoid re-querying later...
calls_with_dataframes_as_args.to_csv('/data/gh-2017/results/calls-with-dataframes-as-args-2.csv', encoding='utf-8', index=False)

In [2]:
flows_df = DSN.flows_reads_to_fits()
flows_df.to_csv('/data/gh-2017/results/flows-read-to-fit-2.csv', encoding='utf-8', index=False)

  + File select time: 15.5784s
  + Found 273870 matching files
  + Query already compiled (cached) `/tmp/queries/d475318973531224689b830105c1300d9621cbf129cc341fb5c98c12a55daac6.dl`
  + Query time: 171.6464s
  + Collation time: 0.8490s
Total time: 188.2752s
  + Had only 273280 allowable files (pre-filter files)
  + File select time: 0.5973s
  + Found 273280 matching files
  + Query already compiled (cached) `/tmp/queries/58aff5d36c21bb2057ad1c14a474da647db60818758e0eca168663099d2fe35e.dl`
  + Query time: 379.9996s
  + Collation time: 2.9589s
Total time: 385.1650s
  + Had only 244216 allowable files (pre-filter files)
  + File select time: 15.3107s
  + Found 162790 matching files
  + Query already compiled (cached) `/tmp/queries/36e091878426f6c2a0bd57bb41091dfd38efdf97f5b6babdc0c2b7b5035915dd.dl`
  + Query time: 128.1906s
  + Collation time: 1.3650s
Total time: 149.4832s
  + Had only 244216 allowable files (pre-filter files)
  + File select time: 14.7501s
  + Found 9664 matching files
 

In [10]:
df1 = execute(
  call(with_name('drop')) % 'call', 
  |where| the_first_arg()
    |is_| use_of(string() % 'cols'),
  compile=True
)
df2 = execute(
  call(with_name('drop')) % 'call', 
  |where| the_first_arg()
    |is_| use_of(
      list_(where_every_child_has_type('string')) % 'cols'
    )
  compile=True
)
df3 = execute(
  call(with_name('drop')) % 'call', 
  |where| the_first_arg()
    |isa| subscript()
    |where| the_value_is(
      attribute()
      |where| the_attribute() 
        |isa| identifier(with_text('columns'))
    )
    |and_w| the_subscript_is(use_of(
      list_(where_every_child_has_type('integer')) % 'col_ids'
    )),
  compile=True
)


  + File select time: 13.4338s
  + Found 69032 matching files
  + Profile time: 0.7431s
  + Compile time: 15.4840s


In [4]:
tmp = execute(
  call(with_name('astype')) % 'call'
  |where| the_first_arg()
    |is_| anything() % 'arg',
  compile=True
)

tmp

  + File select time: 13.7839s
  + Found 79548 matching files
  + Profile time: 0.4675s
  + Compile time: 14.2812s
  + Query time: 85.8940s
  + Collation time: 0.9760s
Total time: 115.4615s


Unnamed: 0,fpath,source_text_arg,start_line_arg,start_col_arg,end_line_arg,end_col_arg,gid_arg,source_text_call,start_line_call,start_col_call,end_line_call,end_col_call,gid_call
0,/cb-target/python/06/31/1174180126dfee4e156a82...,str,53,87,53,90,4191463142264296285,pd.Series(train['question1'].tolist() + train[...,53,11,53,91,5096416365503182228
1,/cb-target/python/06/31/1174180126dfee4e156a82...,str,54,84,54,87,-5238394647061319845,pd.Series(test['question1'].tolist() + test['q...,54,10,54,88,4205226611058234322
2,/cb-target/python/c7/c9/1657f1ad6c8bb714050519...,np.uint8,212,29,212,37,-1849017383981044134,X_batch.astype(np.uint8),212,14,212,38,550118149961555870
3,/cb-target/python/c7/c9/1657f1ad6c8bb714050519...,np.uint8,212,55,212,63,8992599103791713626,y_batch.astype(np.uint8),212,40,212,64,5385273982592926609
4,/cb-target/python/9d/ab/7e881730225f9a3205bd04...,float,66,56,66,61,-6744880828036548225,affair_yrs_married.sum(1).astype(float),66,23,66,62,-4486830595719308224
...,...,...,...,...,...,...,...,...,...,...,...,...,...
233284,/cb-target/python/a3/f6/4de1258a5dbb38fe010ef0...,int,260,75,260,78,5629342229892331681,"dataset['Sex'].map( {'female': 1, 'male': 0} )...",260,21,260,79,4577962677949643897
233285,/cb-target/python/a3/f6/4de1258a5dbb38fe010ef0...,int,303,43,303,46,-7354742359776549189,dataset['Age'].astype(int),303,21,303,47,-551126151521847281
233286,/cb-target/python/a3/f6/4de1258a5dbb38fe010ef0...,int,433,85,433,88,-7470714565591863121,"dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q':...",433,26,433,89,-4079281738720292316
233287,/cb-target/python/a3/f6/4de1258a5dbb38fe010ef0...,int,465,45,465,48,-179869217207470555,dataset['Fare'].astype(int),465,22,465,49,1441489520888547935


In [16]:
tmp = DSN.join()

INFO: Pandarallel will run on 20 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
  + File select time: 14.1967s
  + Found 99172 matching files
  + Query already compiled (cached) `/tmp/queries/0f23c97763ac4775abed1c1616f8e327647ec57d09863e912edef9aa006c077d.dl`
  + Query time: 89.9777s
  + Collation time: 1.0551s
Total time: 105.2807s
  + Had only 97197 allowable files (pre-filter files)
  + File select time: 13.5265s
  + Found 10576 matching files
  + Profile time: 0.8708s
  + Compile time: 13.7252s
  + Query time: 52.7227s
  + Collation time: 0.0639s
Total time: 81.9157s
  + Had only 97197 allowable files (pre-filter files)
  + File select time: 13.5704s
  + Found 7696 matching files
  + Profile time: 0.8195s
  + Compile time: 13.8196s
  + Query time: 24.9833s
  + Collation time: 0.0271s
Total time: 54.2196s


In [17]:
tmp.pretty.value_counts().head(30)

JoinLeft[sort=False]           289905
JoinInner[sort=False]            1482
JoinOuter[sort=False]            1407
JoinRight[sort=False]             285
JoinLeft_outer[sort=False]         54
JoinOuter[sort=True]               12
JoinLeft[sort=True]                 9
JoinInner[sort=True]                8
JoinFull[sort=False]                5
JoinLeftsemi[sort=False]            3
JoinRight_outer[sort=False]         1
JoinLeft_semi[sort=False]           1
JoinFull_outer[sort=False]          1
Name: pretty, dtype: int64

In [19]:
visualize(tmp[tmp.pretty.str.contains('Left_outer')], 'call')