# Advanced merging and concatenating

## 1.1 Filtering joins
Filtering joins filter observations from one table based on whether or not they match an observation in another table.

### Semi join

A semi join filters the left table down to those observations that have a match in the right table. It is similar to an inner join where only the intersection between the tables is returned, but unlike an inner join only the columns from the left table are shown.
No duplicates from the left table are returned, even if there is a one to many relationship.


steps in a semi join
1. merge the left and right tables on key column using an inner join
2. search if the key column in the left table is in thr merged table using the .isin( ) method creating a boolean series
3. Subset the rows of the left table.  



### Anti join 

An anti join returns the observations in the left table that do not have a matching observation on the right table. It only returns columns from the left table and not the right table.


steps in an anti join
1. use a left join returning all rows from the left table. set the indicator argument to true. The indicator argument adds a column called _merge to the output. _merge tells the source of each row.
2. Use the loc accessor and _merge column to select the rows that only appear in the left table.
3. use thr .isin( ) method to filter rows




Anti join example

In our music streaming company dataset, each customer is assigned an employee representative to assist them.Filter the employee table by a table of top customers, returning only those employees who are not assigned to a customer. 

employees.columns  ['srid', 'lname', 'fname', 'title', 'hire_date', 'email']

top_cust.columns   ['cid', 'srid', 'fname', 'lname', 'phone', 'fax', 'email']


1. Merge employees and top_cust

empl_cust = employees.merge(top_cust, on='srid', how='left', indicator=True)

2. Select the srid column where _merge is left_only

srid_list = empl_cust.loc[empl_cust['_merge'] == 'left_only', 'srid']

3. Get employees not working with top customers

print(employees[employees['srid'].isin(srid_list)])

A semi join example

Some of the tracks that have generated the most significant amount of revenue are from TV-shows or are other non-musical audio. You have been given a table of invoices that include top revenue-generating items. Additionally, you have a table of non-musical tracks from the streaming service. In this exercise, you'll use a semi join to find the top revenue-generating non-musical tracks..


non_mus_tcks.columns   ['tid', 'name', 'aid', 'mtid', 'gid', 'u_price']

top_invoices.columns   ['ilid', 'iid', 'tid', 'uprice', 'quantity']

genres.columns   ['gid', 'name']


1. Merge the non_mus_tck and top_invoices tables on tid

tracks_invoices = non_mus_tcks.merge(top_invoices, on = 'tid')

2. Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices

top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(top_invoices['tid'])]

3. Group the top_tracks by gid and count the tid rows

cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})

4. Merge the genres table to cnt_by_gid on gid and print

print(cnt_by_gid.merge(genres, on = 'gid'))


## 1.2 Concatenate dataframes vertically


### The .concat( ) method
pd.concat([df1, df2, df3])

1. ignore_index = True. sets the index from 0 to n-1

pd.concat([df1, df2, df3], ignore_index=True)

2. setting labels to original tables 

pd.concat([df1, df2, df3], ignore_index=False, keys=["name1", "name2", "name3"])

3. concatenate tables with different column names

The concat method by default will include all of the tables in the different tables it is combining. The sort argument if true, will alphabetically sort the different column names in the result

pd.concat([df1, df2], sort=True)

4. concatenate only matching columns, set the  join to inner. By default, concat is set to outer which is why it includes all the columns.

pd.concat([df1, df2], join = "inner")


### The append( ) method

It is a simplified concat method. It supports the ignore_index and sort argument. However, it does not support keys and join. Join is always set to outer.


df.append([df1, df2], ignore_index=True, sort=True)


## 1.3 Verifying integrity

### validating merges
.merge(validate = None)

.merge(validate = "one_to_one")

.merge(validate = "one_to_many")

.merge(validate = "many_to_one")

.merge(validate = "many_to_many")


Eg. 

df1.merge(df2, on = "common_column", validate = "one_to_one")


### verify concatenation
By default, .concat(verify_integriy=False)
If set to true, it will check if there are duplicate values in the index and raise an error if there are.

Eg.

pd.concat([df1, df2], verify_integrity=True)
