# Merging Datasets and Inner Join

We continue with advanced Pandas operations, namely how to merge different dataframes. What we describe here is also explained in the pdf file:

`4. Pandas_Merge.pdf`

In many cases the data that we need to answer our question reside in different files or DataFrames. So we have to *merge* them. How do we do it?

We have to specify which two DataFrames we wish to join and also *how* they will joined, ie what is the common identifier/key.

For example, one dataframe may hold the transaction details and the customer id, and another dataframe can hold the details of each customer (city, telephone) and the customer id. By bringing these two together we have a complete overview.


- *Inner Join*: Taking the inner on a column will return only those rows from the two DataFrames that have a common value in that column.<br><br>
`A.merge(B, how="inner", on=”ID")`

In [17]:
import pandas as pd

transactions = pd.read_csv("https://raw.githubusercontent.com/ahmadajal/DM_ML_course_public/master/2%263.%20Data%26EDA/data/data_merge_examples/transactions.csv")
transactions.head()

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost
0,149332,15.11.2005,1,199.95,107.0
1,172951,29.08.2008,1,199.95,108.0
2,120621,19.10.2007,1,99.95,49.0
3,149236,14.11.2005,1,39.95,18.95
4,149236,12.06.2007,1,79.95,35.0


In [18]:
demographics = pd.read_csv("https://raw.githubusercontent.com/ahmadajal/DM_ML_course_public/master/2%263.%20Data%26EDA/data/data_merge_examples/demographics.csv")
demographics.head()

Unnamed: 0,Customer,Gender,Birthdate,ZIP,JoinDate
0,80365,f,26.08.1991,US-06332,15.09.2009
1,42886,f,04.05.1987,US-08055,12.06.2011
2,84374,m,10.07.1977,US-06400,10.08.1988
3,42291,m,12.07.1963,US-04533,23.07.1998
4,100001,m,08.05.1974,US-02332,21.02.1992


---
# Exercise
---

Merge transactions and demographics by `Customer` using an [inner join](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). And select those customers with birthdate in the year of 1980 (Hint: use `dt.year`)

Note: When merging the columns of the left (right) DataFrame will get a suffix of _x (_y)

In [None]:
# YOUR SOLUTION

# Adjust the format of column "TransDate" to datetime
transactions["TransDate"] = pd.to_datetime(transactions["TransDate"]) 
# Adjust the format of column "Birthdate" to datetime
demographics["Birthdate"] =  pd.to_datetime(demographics["Birthdate"]) 


# the join
df_merged = transactions.merge(demographics, how="inner", on="Customer")
df_merged.head()

In [None]:
# now do the selection
merge1980_2=df_merged.loc[df_merged["Birthdate"]...etc]

# Full Outer Join

The full outer join will merge two dataframes on a column `c` and will return the *union* of all the rows/values. When there are no corresponding joins on the other dataframe, NaNs (Not a Number), NaTs (Not a Time) will be placed. 


In [6]:
# Join 2 dataframes with outer join
pd.options.display.max_rows = 10
transactions.merge(demographics, how="outer", on="Customer")

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,Gender,Birthdate,ZIP,JoinDate
0,149332,15.11.2005,1.0,199.95,107.00,m,07.07.1998,US-08873,05.11.2005
1,149332,13.12.2005,1.0,49.95,24.87,m,07.07.1998,US-08873,05.11.2005
2,149332,05.10.2006,1.0,24.95,12.50,m,07.07.1998,US-08873,05.11.2005
3,172951,29.08.2008,1.0,199.95,108.00,m,16.11.1963,US-11378,04.04.1980
4,172951,29.08.2008,1.0,249.95,162.50,m,16.11.1963,US-11378,04.04.1980
...,...,...,...,...,...,...,...,...,...
224190,200995,,,,,f,09.11.1992,US-62035,11.01.1978
224191,200996,,,,,m,26.08.1976,US-17844,05.04.2005
224192,200997,,,,,f,21.06.1997,US-30324,09.10.1995
224193,200998,,,,,m,02.05.1967,US-10017,11.06.1988


*Sidenote:* Selecting missing and non-missing values

In [None]:
# select missing values
transactions.loc[pd.isnull(myData["PurchAmount"]), ]

In [None]:
# select non-missing values
myData.loc[ ~ pd.isnull(myData["PurchAmount"]), ]

# Left and Right Outer joins

In [None]:
# merging left outer join
# use all the observations from transactions but only those that much from demographics
transactions.merge(demographics, how="left", on="Customer")

In [None]:
# do the same for right outer joins