# Comparing values in single columns between Pandas dataframes

One of your DataFrames has a column of values. A second DataFrame has a volumn of similar values. How many values do these columns have in common?

Depending on your goal (merging, de-duplicating, identifying index positions of non-similar values), multiple solutions exist. h/t [Chris Albon](https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/) for data.

## Create two DataFrames

One with null values. A few employer values overlap.

In [30]:
import pandas as pd
import numpy as np

In [31]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'],
        'employer': ['Facebook', 'Google', 'Amazon', 'Apple', np.nan]}

df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name', 'employer'])
df_a

Unnamed: 0,subject_id,first_name,last_name,employer
0,1,Alex,Anderson,Facebook
1,2,Amy,Ackerman,Google
2,3,Allen,Ali,Amazon
3,4,Alice,Aoni,Apple
4,5,Ayoung,Atiches,


In [103]:
raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'],
        'employer': [np.nan, 'Kensho', 'Robinhood', 'LendUp', 'Facebook']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name', 'employer'])
df_b

Unnamed: 0,subject_id,first_name,last_name,employer
0,4,Billy,Bonder,
1,5,Brian,Black,Kensho
2,6,Bran,Balwner,Robinhood
3,7,Bryce,Brice,LendUp
4,8,Betty,Btisan,Facebook


In [77]:
df_a.employer.sort_values().values == df_b.employer.sort_values().values

  if __name__ == '__main__':


array([False, False, False, False,  True], dtype=bool)

In [96]:
df_b.columns

Index([u'subject_id', u'first_name', u'last_name', u'employer'], dtype='object')

In [106]:
df_a.merge(df_b, left_on='employer', right_on='employer', )

Unnamed: 0,subject_id_x,first_name_x,last_name_x,employer,subject_id_y,first_name_y,last_name_y
0,1,Alex,Anderson,Facebook,8,Betty,Btisan
1,5,Ayoung,Atiches,,4,Billy,Bonder


In [107]:
pd.concat([df_a, df_b], axis=1, join='inner', join_axes=[df_a.index])

Unnamed: 0,subject_id,first_name,last_name,employer,subject_id.1,first_name.1,last_name.1,employer.1
0,1,Alex,Anderson,Facebook,4,Billy,Bonder,
1,2,Amy,Ackerman,Google,5,Brian,Black,Kensho
2,3,Allen,Ali,Amazon,6,Bran,Balwner,Robinhood
3,4,Alice,Aoni,Apple,7,Bryce,Brice,LendUp
4,5,Ayoung,Atiches,,8,Betty,Btisan,Facebook


In [108]:
df_a.append(df_b, ignore_index=True)

Unnamed: 0,subject_id,first_name,last_name,employer
0,1,Alex,Anderson,Facebook
1,2,Amy,Ackerman,Google
2,3,Allen,Ali,Amazon
3,4,Alice,Aoni,Apple
4,5,Ayoung,Atiches,
5,4,Billy,Bonder,
6,5,Brian,Black,Kensho
7,6,Bran,Balwner,Robinhood
8,7,Bryce,Brice,LendUp
9,8,Betty,Btisan,Facebook


In [58]:
df_b.employer.sort_values().values

array(['Dataminr', 'Kensho', 'LendUp', 'Robinhood', nan], dtype=object)

In [59]:
df_a.employer.sort_values().values == df_b.employer.sort_values().values

  if __name__ == '__main__':


array([False, False, False, False,  True], dtype=bool)

In [61]:
a = np.array([1,2,3,4,5])
(a > 1).all() and (a < 5).all()

False

In [64]:
(a > 1) & (a < 5)

array([False,  True,  True,  True, False], dtype=bool)

In [63]:
df_a.employer.all() == df_b.employer.all()

False

In [73]:
df_a.employer.isin(['Facebook'])

0     True
1    False
2    False
3    False
4    False
Name: employer, dtype: bool

In [74]:
import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

In [52]:
df_a.employer.apply(lambda x: x)

0    Facebook
1      Google
2      Amazon
3       Apple
4         NaN
Name: employer, dtype: object

In [47]:
df_a.employer.isin(['Facebook'])

0     True
1    False
2    False
3    False
4    False
Name: employer, dtype: bool

In [45]:
df_b.employer.values

array([nan, 'Kensho', 'Robinhood', 'LendUp', 'Dataminr'], dtype=object)

In [22]:
missing = np.where(df_a['employer'].isin(df_b['employer']))

In [25]:
df_a.isin(df_a.pop('employer').values.tolist()).all(axis=1)

0    False
1    False
2    False
3    False
4    False
dtype: bool