<a href="https://colab.research.google.com/github/mateosuster/pythonungs/blob/master/codigos/pandas/10%20-%20Merging%20DataFrames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Merging DataFrames

In [1]:
import pandas as pd

## Nuestro Conjunto de Datos
- Nuestros conjuntos de datos están distribuidos en múltiples archivos en esta sección. Cada archivo tiene el prefijo `restaurant_`.
- El archivo `customers.csv` almacena los clientes de nuestro restaurante.
- El archivo `foods.csv` almacena los elementos del menú de nuestro restaurante.
- Los archivos `week_1_sales` y `week_2_sales` almacenan nuestros pedidos.

In [3]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")

## La Función pd.concat I
- La función `concat` concatena un **DataFrame** al final de otro. [Link a la documentación](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
- Los índices originales se mantendrán por defecto. Establece `ignore_index` en True para generar un nuevo índice.
- El parámetro `keys` crea un **MultiIndex** utilizando las claves/etiquetas especificadas.

In [4]:
week1.head()

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9


In [5]:
week2.head()

Unnamed: 0,Customer ID,Food ID
0,688,10
1,813,7
2,495,10
3,189,5
4,267,3


In [6]:
len(week1)

250

In [7]:
len(week2)

250

In [10]:
pd.concat([week1, week2], ignore_index=False) #ignore_index=False es el parametro por default

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9
...,...,...
245,783,10
246,556,10
247,547,9
248,252,9


In [11]:
pd.concat([week1, week2], ignore_index=True)


Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9
...,...,...
495,783,10
496,556,10
497,547,9
498,252,9


In [12]:
pd.concat([week1, week2], keys=["Week 1", "Week 2"]).sort_index()

Unnamed: 0,Unnamed: 1,Customer ID,Food ID
Week 1,0,537,9
Week 1,1,97,4
Week 1,2,658,1
Week 1,3,202,2
Week 1,4,155,9
...,...,...,...
Week 2,245,783,10
Week 2,246,556,10
Week 2,247,547,9
Week 2,248,252,9


## La Función pd.concat II
- Pandas concatenará los **DataFrames** a lo largo del eje de filas/índices.
- Pandas incluirá todas las columnas que existan en cualquiera de los **DataFrames**. Si no hay valores coincidentes, pandas usará valores `NaN`.
- Podemos pasar al parámetro `axis` un argumento de `"columns"` para concatenar en el eje de columnas.

In [13]:
df1 = pd.DataFrame([1, 2, 3], columns=["A"])
df1

Unnamed: 0,A
0,1
1,2
2,3


In [14]:
df2 = pd.DataFrame([4, 5, 6], columns=["B"])
df2

Unnamed: 0,B
0,4
1,5
2,6


In [15]:
pd.concat([df1, df2], axis="index")

Unnamed: 0,A,B
0,1.0,
1,2.0,
2,3.0,
0,,4.0
1,,5.0
2,,6.0


In [16]:
pd.concat([df1, df2], axis="columns")

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Joins a la Izquierda
- El método `merge` une dos **DataFrames** basándose en valores compartidos en una columna o un índice.
- Un join a la izquierda fusiona un **DataFrame** con otro basándose en valores del primero.
- El **DataFrame** "izquierdo" es aquel sobre el que invocamos el método `merge`.
- Si el valor del **DataFrame** izquierdo no se encuentra en el **DataFrame** derecho, la fila contendrá valores `NaN`.




![](https://datacomy.com/data_analysis/pandas/merge/types-of-joins.png)

In [17]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")

In [18]:
week1.head()

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9


In [19]:
foods.head(5)

Unnamed: 0,Food ID,Food Item,Price
0,1,Sushi,3.99
1,2,Burrito,9.99
2,3,Taco,2.99
3,4,Quesadilla,4.25
4,5,Pizza,2.49


In [20]:
week1.merge(foods, how="left", on="Food ID")

Unnamed: 0,Customer ID,Food ID,Food Item,Price
0,537,9,Donut,0.99
1,97,4,Quesadilla,4.25
2,658,1,Sushi,3.99
3,202,2,Burrito,9.99
4,155,9,Donut,0.99
...,...,...,...,...
245,413,9,Donut,0.99
246,926,6,Pasta,13.99
247,134,3,Taco,2.99
248,396,6,Pasta,13.99


## Los Parámetros left_on y right_on
- Los parámetros `left_on` y `right_on` designan los nombres de las columnas de cada **DataFrame** a usar en la fusión.

In [21]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")

In [22]:
week2.head()

Unnamed: 0,Customer ID,Food ID
0,688,10
1,813,7
2,495,10
3,189,5
4,267,3


In [23]:
customers.head()

Unnamed: 0,ID,First Name,Last Name,Gender,Company,Occupation
0,1,Joseph,Perkins,Male,Dynazzy,Community Outreach Specialist
1,2,Jennifer,Alvarez,Female,DabZ,Senior Quality Engineer
2,3,Roger,Black,Male,Tagfeed,Account Executive
3,4,Steven,Evans,Male,Fatz,Registered Nurse
4,5,Judy,Morrison,Female,Demivee,Legal Assistant


In [25]:
week2.merge(customers, how="left", left_on="Customer ID", right_on="ID")

Unnamed: 0,Customer ID,Food ID,ID,First Name,Last Name,Gender,Company,Occupation
0,688,10,688,Carl,Williamson,Male,Thoughtmix,Graphic Designer
1,813,7,813,Johnny,Walker,Male,Kayveo,Developer II
2,495,10,495,Deborah,Little,Female,Babbleblab,VP Accounting
3,189,5,189,Roger,Gordon,Male,Skilith,Operator
4,267,3,267,Matthew,Wood,Male,Agimba,Product Engineer
...,...,...,...,...,...,...,...,...
245,783,10,783,Phyllis,Meyer,Female,Voolia,Information Systems Manager
246,556,10,556,Samuel,Bailey,Male,Oyoloo,Nurse
247,547,9,547,Tina,Watkins,Female,Thoughtstorm,Accountant II
248,252,9,252,Douglas,Powell,Male,Jetwire,Geologist IV


In [26]:
week2.merge(customers, how="left", left_on="Customer ID", right_on="ID").drop("ID", axis="columns")

Unnamed: 0,Customer ID,Food ID,First Name,Last Name,Gender,Company,Occupation
0,688,10,Carl,Williamson,Male,Thoughtmix,Graphic Designer
1,813,7,Johnny,Walker,Male,Kayveo,Developer II
2,495,10,Deborah,Little,Female,Babbleblab,VP Accounting
3,189,5,Roger,Gordon,Male,Skilith,Operator
4,267,3,Matthew,Wood,Male,Agimba,Product Engineer
...,...,...,...,...,...,...,...
245,783,10,Phyllis,Meyer,Female,Voolia,Information Systems Manager
246,556,10,Samuel,Bailey,Male,Oyoloo,Nurse
247,547,9,Tina,Watkins,Female,Thoughtstorm,Accountant II
248,252,9,Douglas,Powell,Male,Jetwire,Geologist IV


## Joins Internos I
- Los joins internos fusionan dos tablas basándose en valores *compartidos*/*comunes* en las columnas.
- Si solo un **DataFrame** tiene un valor, pandas lo excluirá del conjunto de resultados finales.
- Si el mismo ID ocurre múltiples veces, pandas almacenará cada posible combinación de los valores.
- El diseño del join asegura que los resultados serán los mismos sin importar en qué **DataFrame** se invoque el método `merge`.

In [None]:
foods = pd.read_csv("restaurant_foods.csv")
customers = pd.read_csv("restaurant_customers.csv")
week1 = pd.read_csv("restaurant_week_1_sales.csv")
week2 = pd.read_csv("restaurant_week_2_sales.csv")

In [None]:
week1[week1["Customer ID"] == 155]

Unnamed: 0,Customer ID,Food ID
4,155,9
17,155,1


In [None]:
week2[week2["Customer ID"] == 155]

Unnamed: 0,Customer ID,Food ID
208,155,3


In [None]:
week1.merge(week2, how="inner", on="Customer ID", suffixes=[" - Week 1", " - Week 2"])

Unnamed: 0,Customer ID,Food ID - Week 1,Food ID - Week 2
0,537,9,5
1,155,9,3
2,155,1,3
3,503,5,8
4,503,5,9
...,...,...,...
57,945,5,4
58,343,3,5
59,343,3,2
60,343,3,7


## Joins Internos II
- Podemos pasar múltiples argumentos al parámetro `on` del método `merge`. Pandas requerirá coincidencias en ambas columnas a través de los **DataFrames**.

In [27]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")

In [28]:
week1.head()

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9


In [29]:
week2.head()

Unnamed: 0,Customer ID,Food ID
0,688,10
1,813,7
2,495,10
3,189,5
4,267,3


In [30]:
week1.merge(week2, how="inner", on=["Customer ID", "Food ID"])

Unnamed: 0,Customer ID,Food ID
0,304,3
1,540,3
2,937,10
3,233,3
4,21,4
5,21,4
6,922,1
7,578,5
8,578,5


In [31]:
condition_one = week1["Customer ID"] == 578
condition_two = week1["Food ID"] == 5
week1[condition_one & condition_two]

Unnamed: 0,Customer ID,Food ID
224,578,5


In [32]:
condition_one = week2["Customer ID"] == 578
condition_two = week2["Food ID"] == 5
week2[condition_one & condition_two]

Unnamed: 0,Customer ID,Food ID
29,578,5
189,578,5


## Join Completo/Externo
- Un **join completo/externo** une los valores que se encuentran en cualquiera de los **DataFrames** o en ambos **DataFrames**.
- A Pandas no le importa si un valor existe en un **DataFrame** pero no en el otro.
- Si un valor no existe en un **DataFrame**, tendrá un `NaN`.

In [33]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")

In [34]:
week1.head()

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9


In [35]:
week2.head()

Unnamed: 0,Customer ID,Food ID
0,688,10
1,813,7
2,495,10
3,189,5
4,267,3


In [36]:
week1.merge(week2, how="outer", on="Customer ID", suffixes=[" - Week 1", " - Week 2"])


Unnamed: 0,Customer ID,Food ID - Week 1,Food ID - Week 2
0,537,9.0,5.0
1,97,4.0,
2,658,1.0,
3,202,2.0,
4,155,9.0,3.0
...,...,...,...
449,855,,4.0
450,559,,10.0
451,276,,4.0
452,556,,10.0


In [37]:
week1.merge(week2, how="outer", on="Customer ID", suffixes=[" - Week 1", " - Week 2"], indicator=True)

Unnamed: 0,Customer ID,Food ID - Week 1,Food ID - Week 2,_merge
0,537,9.0,5.0,both
1,97,4.0,,left_only
2,658,1.0,,left_only
3,202,2.0,,left_only
4,155,9.0,3.0,both
...,...,...,...,...
449,855,,4.0,right_only
450,559,,10.0,right_only
451,276,,4.0,right_only
452,556,,10.0,right_only


In [39]:
merged = week1.merge(week2, how="outer", on="Customer ID", suffixes=[" - Week 1", " - Week 2"], indicator=True)
merged["_merge"].value_counts()

_merge
right_only    197
left_only     195
both           62
Name: count, dtype: int64

In [40]:
merged[merged["_merge"].isin(["left_only", "right_only"])]

Unnamed: 0,Customer ID,Food ID - Week 1,Food ID - Week 2,_merge
1,97,4.0,,left_only
2,658,1.0,,left_only
3,202,2.0,,left_only
6,213,8.0,,left_only
7,600,1.0,,left_only
...,...,...,...,...
449,855,,4.0,right_only
450,559,,10.0,right_only
451,276,,4.0,right_only
452,556,,10.0,right_only


## Fusionar por Índices con los Parámetros left_index y right_index
- Usa el parámetro `on` si la(s) columna(s) a emparejar tienen los mismos nombres en ambos **DataFrames**.
- Usa los parámetros `left_on` y `right_on` si la(s) columna(s) a emparejar tienen nombres diferentes en los dos **DataFrames**.
- Usa los parámetros `left_index` o `right_index` (establecidos en True) si los valores a emparejar se encuentran en el índice de un **DataFrame**.

In [41]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv",
                    index_col="Food ID")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv",
                         index_col="ID")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")


In [42]:
week1.head()

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9


In [43]:
customers.head()

Unnamed: 0_level_0,First Name,Last Name,Gender,Company,Occupation
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Joseph,Perkins,Male,Dynazzy,Community Outreach Specialist
2,Jennifer,Alvarez,Female,DabZ,Senior Quality Engineer
3,Roger,Black,Male,Tagfeed,Account Executive
4,Steven,Evans,Male,Fatz,Registered Nurse
5,Judy,Morrison,Female,Demivee,Legal Assistant


In [44]:
foods.head()

Unnamed: 0_level_0,Food Item,Price
Food ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Sushi,3.99
2,Burrito,9.99
3,Taco,2.99
4,Quesadilla,4.25
5,Pizza,2.49


In [45]:
week1.merge(
    customers, how="left", left_on="Customer ID", right_index=True
).merge(foods, how="left", left_on="Food ID", right_index=True)

Unnamed: 0,Customer ID,Food ID,First Name,Last Name,Gender,Company,Occupation,Food Item,Price
0,537,9,Cheryl,Carroll,Female,Zoombeat,Registered Nurse,Donut,0.99
1,97,4,Amanda,Watkins,Female,Ozu,Account Coordinator,Quesadilla,4.25
2,658,1,Patrick,Webb,Male,Browsebug,Community Outreach Specialist,Sushi,3.99
3,202,2,Louis,Campbell,Male,Rhynoodle,Account Representative III,Burrito,9.99
4,155,9,Carolyn,Diaz,Female,Gigazoom,Database Administrator III,Donut,0.99
...,...,...,...,...,...,...,...,...,...
245,413,9,Diane,Bailey,Female,Wikibox,Technical Writer,Donut,0.99
246,926,6,Anne,Wagner,Female,Skyba,Legal Assistant,Pasta,13.99
247,134,3,Diana,Hall,Female,Quinu,Financial Advisor,Taco,2.99
248,396,6,Juan,Romero,Male,Zoonder,Analyst Programmer,Pasta,13.99


## El Método join
- El método `join` es un atajo para concatenar dos **DataFrames** cuando se fusionan por etiquetas de índice.

In [48]:
foods = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_foods.csv")
customers = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_customers.csv")
week1 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_sales.csv")
week2 = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_2_sales.csv")
times = pd.read_csv("https://raw.githubusercontent.com/mateosuster/pythonungs/master/data/restaurant_week_1_times.csv")

In [46]:
week1.head()

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9


In [49]:
times.head()

Unnamed: 0,Time of Day
0,14:54:59
1,20:55:17
2,01:16:22
3,16:17:26
4,19:26:11


In [50]:
week1.merge(times, how="left", left_index=True, right_index=True)

Unnamed: 0,Customer ID,Food ID,Time of Day
0,537,9,14:54:59
1,97,4,20:55:17
2,658,1,01:16:22
3,202,2,16:17:26
4,155,9,19:26:11
...,...,...,...
245,413,9,04:44:14
246,926,6,07:46:21
247,134,3,20:45:08
248,396,6,01:09:06


In [51]:
week1.join(times)

Unnamed: 0,Customer ID,Food ID,Time of Day
0,537,9,14:54:59
1,97,4,20:55:17
2,658,1,01:16:22
3,202,2,16:17:26
4,155,9,19:26:11
...,...,...,...
245,413,9,04:44:14
246,926,6,07:46:21
247,134,3,20:45:08
248,396,6,01:09:06
