Skip to content

hansalemaos/a_pandas_ex_intersection_difference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Computes the intersection/symmetric difference of n DataFrames/Series

Installation

pip install a-pandas-ex-intersection-difference

Usage

from  a_pandas_ex_intersection_difference import pd_add_set
pd_add_set()
import pandas as pd
THE CODE ABOVE WILL ADD SOME METHODS TO! YOU CAN USE PANDAS LIKE YOU DID BEFORE, BUT YOU WILL HAVE A COUPLE OF METHODS MORE:
  • pandas.DataFrame.ds_set_intersections / pandas.Series.ds_set_intersections
  • pandas.DataFrame.ds_set_symmetric_difference / pandas.Series.ds_set_symmetric_difference
  • pandas.DataFrame.ds_set_union / pandas.Series.ds_set_union
  • pandas.DataFrame.ds_value_counts_to_column / pandas.Series.ds_value_counts_to_column
pandas.DataFrame.ds_set_intersections / pandas.Series.ds_set_intersections
    #Computes the intersection of n DataFrames/Series

    #Example
    df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")

    #Let's create some DataFrames with random data from df

    df1 = df.sample(len(df) - len(df)//2).copy()
    df2 = df.sample(len(df) - len(df)//2).copy()
    df3 = df.sample(len(df) - len(df)//2).copy()
    df4 = df.sample(len(df) - len(df)//2).copy()
    df5 = df.sample(len(df) - len(df)//2).copy()

    
    df1.ds_set_intersections(df2) #Comparing 2 DataFrames
    Out[14]:
         Parch  PassengerId      Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0        1          802   26.2500         1  ...      1        S  female   NaN
    1        0          506  108.9000         0  ...      1        C    male   C65
    2        0          386   73.5000         0  ...      0        S    male   NaN
    3        0          621   14.4542         0  ...      1        C    male   NaN
    4        1          273   19.5000         1  ...      0        S  female   NaN
    ..     ...          ...       ...       ...  ...    ...      ...     ...   ...
    439      0          240   12.2750         0  ...      0        S    male   NaN
    440      0          235   10.5000         0  ...      0        S    male   NaN
    441      1          269  153.4625         1  ...      0        S  female  C125
    442      0          394  113.2750         1  ...      1        C  female   D36
    443      0          400   12.6500         1  ...      0        S  female   NaN
    [444 rows x 12 columns]     
    
    
    df1.ds_set_intersections(df2,df3)  #Comparing 3 DataFrames   
    Out[15]:
         Parch  PassengerId      Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0        0          506  108.9000         0  ...      1        C    male   C65
    1        1          480   12.2875         1  ...      0        S  female   NaN
    2        1          581   30.0000         1  ...      1        S  female   NaN
    3        1          447   19.5000         1  ...      0        S  female   NaN
    4        0           16   16.0000         1  ...      0        S  female   NaN
    ..     ...          ...       ...       ...  ...    ...      ...     ...   ...
    340      2          154   14.5000         0  ...      0        S    male   NaN
    341      0          668    7.7750         0  ...      0        S    male   NaN
    342      0          702   26.2875         1  ...      0        S    male   E24
    343      0          610  153.4625         1  ...      0        S  female  C125
    344      0          450   30.5000         1  ...      0        S    male  C104
    [345 rows x 12 columns]    
    
    
    df1.ds_set_intersections(df2,df3, df4)  #Comparing 4 DataFrames 
    Out[16]:
         Parch  PassengerId      Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0        0          506  108.9000         0  ...      1        C    male   C65
    1        1          581   30.0000         1  ...      1        S  female   NaN
    2        0          283    9.5000         0  ...      0        S    male   NaN
    3        0          488   29.7000         0  ...      0        C    male   B37
    4        0          610  153.4625         1  ...      0        S  female  C125
    ..     ...          ...       ...       ...  ...    ...      ...     ...   ...
    227      0           23    8.0292         1  ...      0        Q  female   NaN
    228      1          619   39.0000         1  ...      2        S  female    F4
    229      2          473   27.7500         1  ...      1        S  female   NaN
    230      0          253   26.5500         0  ...      0        S    male   C87
    231      0          618   16.1000         0  ...      1        S  female   NaN
    [232 rows x 12 columns]   
    
    
    df1.ds_set_intersections(df2,df3, df4, df5)  #Comparing 5 DataFrames
    Out[17]:
         Parch  PassengerId      Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0        0          506  108.9000         0  ...      1        C    male   C65
    1        1          581   30.0000         1  ...      1        S  female   NaN
    2        1           17   29.1250         0  ...      4        Q    male   NaN
    3        2           59   27.7500         1  ...      1        S  female   NaN
    4        0          463   38.5000         0  ...      0        S    male   E63
    ..     ...          ...       ...       ...  ...    ...      ...     ...   ...
    140      2          166   20.5250         1  ...      0        S    male   NaN
    141      0          705    7.8542         0  ...      1        S    male   NaN
    142      1           51   39.6875         0  ...      4        S    male   NaN
    143      0          833    7.2292         0  ...      0        C    male   NaN
    144      2          154   14.5000         0  ...      0        S    male   NaN
    [145 rows x 12 columns]
pandas.DataFrame.ds_set_symmetric_difference / pandas.Series.ds_set_symmetric_difference
    #Computes the symmetric difference of n DataFrames/Series

    #Example
    df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")

    #Let's create some DataFrames with random data from df

    df1 = df.sample(len(df) - len(df)//2).copy()
    df2 = df.sample(len(df) - len(df)//2).copy()
    df3 = df.sample(len(df) - len(df)//2).copy()
    df4 = df.sample(len(df) - len(df)//2).copy()
    df5 = df.sample(len(df) - len(df)//2).copy()

    df1.ds_set_symmetric_difference(df2) #Comparing 2 DataFrames
    Out[18]:
         Parch  PassengerId      Fare  ...  Embarked     Sex        Cabin
    0        0          567    7.8958  ...         S    male          NaN
    1        0           46    8.0500  ...         S    male          NaN
    2        2          342  263.0000  ...         S  female  C23 C25 C27
    3        0          845    8.6625  ...         S    male          NaN
    4        0            1    7.2500  ...         S    male          NaN
    ..     ...          ...       ...  ...       ...     ...          ...
    219      0          865   13.0000  ...         S    male          NaN
    220      5          639   39.6875  ...         S  female          NaN
    221      0           30    7.8958  ...         S    male          NaN
    222      0          332   28.5000  ...         S    male         C124
    223      0          884   10.5000  ...         S    male          NaN
    [448 rows x 12 columns]
    
    
    df1.ds_set_symmetric_difference(df2,df3)  #Comparing 3 DataFrames
    Out[19]:
         Parch  PassengerId     Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0        0          567   7.8958         0  ...      0        S    male   NaN
    1        0           46   8.0500         0  ...      0        S    male   NaN
    2        0          845   8.6625         0  ...      0        S    male   NaN
    3        0          142   7.7500         1  ...      0        S  female   NaN
    4        0          579  14.4583         0  ...      1        C  female   NaN
    ..     ...          ...      ...       ...  ...    ...      ...     ...   ...
    106      0          430   8.0500         1  ...      0        S    male   E10
    107      1          363  14.4542         0  ...      0        C  female   NaN
    108      1          531  26.0000         1  ...      1        S  female   NaN
    109      0          748  13.0000         1  ...      0        S  female   NaN
    110      0          876   7.2250         1  ...      0        C  female   NaN
    [339 rows x 12 columns]
    
    
    df1.ds_set_symmetric_difference(df2,df3,df4)  #Comparing 4 DataFrames
    Out[20]:
        Parch  PassengerId      Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0       0          567    7.8958         0  ...      0        S    male   NaN
    1       0           46    8.0500         0  ...      0        S    male   NaN
    2       0          142    7.7500         1  ...      0        S  female   NaN
    3       0          579   14.4583         0  ...      1        C  female   NaN
    4       0          365   15.5000         0  ...      1        Q    male   NaN
    ..    ...          ...       ...       ...  ...    ...      ...     ...   ...
    39      2          551  110.8833         1  ...      0        C    male   C70
    40      0           19   18.0000         0  ...      1        S  female   NaN
    41      0          615    8.0500         0  ...      0        S    male   NaN
    42      0          204    7.2250         0  ...      0        C    male   NaN
    43      1          375   21.0750         0  ...      3        S  female   NaN
    [204 rows x 12 columns]
    df1.ds_set_symmetric_difference(df2,df3,df4,df5)  #Comparing 5 DataFrames
    Out[21]:
        Parch  PassengerId     Fare  Survived  ...  SibSp Embarked     Sex Cabin
    0       0          567   7.8958         0  ...      0        S    male   NaN
    1       0          579  14.4583         0  ...      1        C  female   NaN
    2       0          365  15.5000         0  ...      1        Q    male   NaN
    3       0          644  56.4958         1  ...      0        S    male   NaN
    4       0          708  26.2875         1  ...      0        S    male   E24
    ..    ...          ...      ...       ...  ...    ...      ...     ...   ...
    25      0          343  13.0000         0  ...      0        S    male   NaN
    26      0          656  73.5000         0  ...      2        S    male   NaN
    27      0          407   7.7500         0  ...      0        S    male   NaN
    28      0          301   7.7500         1  ...      0        Q  female   NaN
    29      0          819   6.4500         0  ...      0        S    male   NaN
    [125 rows x 12 columns]

    
    
        Parameters
            args: Union[pd.Series, pd.DataFrame]
                DataFrames or Series, how many you want
            accept_df_with_different_columns: bool=True
                Let's say you have one DataFrame whose columns are:  [Parch,  PassengerId, Fare, Survived, SibSp,Embarked,  Sex, Cabin]
                If you want to compare it to: [Flight, Fare, Survived, SibSp,Embarked,  Sex, Cabin]
                It won't work, unless you pass accept_df_with_different_columns=True
                Only the columns that are in all dataframes will be compared

        Returns
            pd.DataFrame
pandas.DataFrame.ds_set_union / pandas.Series.ds_set_union
    df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")

    #Let's create some DataFrames with random data from df

    df1 = df.sample(len(df) - len(df)//2).copy()
    df2 = df.sample(len(df) - len(df)//2).copy()
    df3 = df.sample(len(df) - len(df)//2).copy()
    df4 = df.sample(len(df) - len(df)//2).copy()
    df5 = df.sample(len(df) - len(df)//2).copy()


    df1[['PassengerId','Survived','Name']].ds_set_union(df2[['Pclass','Cabin','Name']])
    Out[17]:
                                                      Name
    0                                Carbines, Mr. William
    1                            Sundman, Mr. Johan Julian
    2                                     Dimic, Mr. Jovan
    3                          Harder, Mr. George Achilles
    4                                 Rice, Master. Eugene
    ..                                                 ...
    887                       Carlsson, Mr. August Sigfrid
    888                       Hoyt, Mr. Frederick Maxfield
    889                      Somerton, Mr. Francis William
    890                     Francatelli, Miss. Laura Mabel
    891  Thayer, Mrs. John Borland (Marian Longstreth M...


    #If, for whatever reason, you don't want to use pd.concat(), you can use this method.
    #Don't use this method if you can use pd.concat

        Parameters
            args: Union[pd.Series, pd.DataFrame]
                DataFrames or Series, how many you want
            accept_df_with_different_columns: bool=True
                Let's say you have one DataFrame whose columns are:  [Parch,  PassengerId, Fare, Survived, SibSp,Embarked,  Sex, Cabin]
                If you want to compare it to: [Flight, Fare, Survived, SibSp,Embarked,  Sex, Cabin]
                It won't work, unless you pass accept_df_with_different_columns=True
                Only the columns that are in all dataframes will be compared

        Returns
            pd.DataFrame
pandas.DataFrame.ds_value_counts_to_column / pandas.Series.ds_value_counts_to_column
    df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")

    df2.Sex.ds_value_counts_to_column()
         PassengerId  Survived  Pclass  ...      Fare Cabin  Embarked
    504          505         1       1  ...   86.5000   B79         S
    781          782         1       1  ...   57.0000   B20         S
    855          856         1       3  ...    9.3500   NaN         S
    552          553         0       3  ...    7.8292   NaN         Q
    777          778         1       3  ...   12.4750   NaN         S
    ..           ...       ...     ...  ...       ...   ...       ...
    756          757         0       3  ...    7.7958   NaN         S
    224          225         1       1  ...   90.0000   C93         S
    488          489         0       3  ...    8.0500   NaN         S
    309          310         1       1  ...   56.9292   E36         C
    581          582         1       1  ...  110.8833   C68         C
    [446 rows x 12 columns]

    df2.Sex.ds_value_counts_to_column()
    Out[22]:
    0      152
    1      152
    2      152
    3      294
    4      152
          ...
    441    294
    442    294
    443    294
    444    152
    445    152
    Name: 0, Length: 446, dtype: int64

    This method could also be useful, when you are comparing DataFrames, since it counts the different values in a Series
    and returns a DataFrame that you can merge with your original DataFrame
        Parameters
            df: pd.Series
        Returns
            pd.DataFrame