<h1 id="Pandas-Combining-DataFrames">Pandas Combining DataFrames</h1>

- pd.DataFrame.<b>concat()</b> -> Used to append 1 dataframe to the end of another 

- pd.DataFrame.<b>merge()</b> -> Combine 2 different dataframes with a shared column into 1 dataframe

- pd.DataFrame.<b>join()</b> -> Combine 2 different dataframes into 1 using index only (i.e. no shared column)

In [1]:
import pandas as pd

tmp

This parameter in both `.join()` and `.merge()` tells the compiler what sort of join to effect. We'll cover this in detail when we discuss SQL.

![image showcasing how the how parameter in a join/merge would combine the two datasets, using venn-style diagrams](https://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)
[[Image Source]](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

<p><span style="font-size: 14pt; color: #169179;">To <strong>merge,</strong> use</span></p>
<pre><span style="font-size: 14pt; color: #169179;"> pd.merge(a,b,on="some column",how ="inner/outer/left/right")</span></pre>
<p><span style="font-size: 14pt; color: #169179;">Where <strong>a,b</strong> are <strong>DataFrames.</strong>&nbsp; For left/right, <strong>a</strong> is <strong>left,</strong> <strong>b</strong> is <strong>right.</strong></span></p>
<p><span style="font-size: 14pt; color: #169179;">This will <strong>return</strong> the <strong>merged</strong> dataframe.&nbsp; It does <strong>NOT</strong> change a or b</span></p>
<p><span style="font-size: 14pt; color: #169179;"><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html</a></span></p>

In [13]:
x = pd.DataFrame(
     { "Types":[1,2,3,4,5],
        "Info": ["a","b","c","d","e"]
     })
y = pd.DataFrame(
     { "Types":[1,3,5],
        "Stuff": ["m","n","o"]
     })

print(x)
print(y)

   Types Info
0      1    a
1      2    b
2      3    c
3      4    d
4      5    e
   Types Stuff
0      1     m
1      3     n
2      5     o


<h2>Inner Join</h2>

<p><span style="font-size: 14pt; color: #169179;"><strong>Inner Join</strong> =&nbsp; Me<strong>r</strong>ge only&nbsp; records that are in <strong>BOTH,</strong> <strong>x</strong> and <strong>y.&nbsp;&nbsp;</strong>No need for filling columns with missing data.&nbsp; Note this is the default.</span></p>

In [5]:
pd.merge(x,y,on="Types",how="inner")

Unnamed: 0,Types,Info,Stuff
0,1,a,m
1,3,c,n
2,5,e,o


<h2>Outer Join</h2>

<p><span style="font-size: 14pt; color: #169179;"><strong>Outer Join</strong> - <strong>Join</strong> <strong>all </strong>records from <strong>x </strong>with <strong>all </strong>records from <strong>y, </strong>so this is a <strong>union.</strong>&nbsp; <strong>Columns</strong> that <strong>exist</strong> in <strong>one dataframe,</strong> but <strong>not the other</strong>, are <strong>set</strong> to <strong>NaN</strong> in the <strong>other dataframe</strong>.&nbsp;</span></p>

In [12]:
pd.merge(x,y,on="Types",how="outer")

Unnamed: 0,Types,Info,Stuff
0,1,g,m
1,2,b,
2,3,c,n
3,4,d,
4,5,e,o


<h2>Left Join</h2>

<p><span style="font-size: 14pt; color: #169179;"><strong>Left Join</strong> - <strong>Keep all</strong> the records in <strong>x</strong> and <strong>merge</strong> in <strong>records</strong> in<strong> y</strong> that <strong>match</strong> the <strong>on column</strong>.&nbsp; For columns in y not in x, set to missing.</span></p>

In [7]:
pd.merge(x,y,on="Types",how="left")

Unnamed: 0,Types,Info,Stuff
0,1,a,m
1,2,b,
2,3,c,n
3,4,d,
4,5,e,o


<h2>Right Join</h2>

<p><strong style="color: #169179; font-size: 18.6667px;">Right Join</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;-&nbsp;</span><strong style="color: #169179; font-size: 18.6667px;">Keep all</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;the records in&nbsp;</span><strong style="color: #169179; font-size: 18.6667px;">y</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;and&nbsp;</span><strong style="color: #169179; font-size: 18.6667px;">merge</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;in&nbsp;</span><strong style="color: #169179; font-size: 18.6667px;">records</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;in</span><strong style="color: #169179; font-size: 18.6667px;">&nbsp;x</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;that&nbsp;</span><strong style="color: #169179; font-size: 18.6667px;">match</strong><span style="color: #169179; font-size: 18.6667px;">&nbsp;the&nbsp;</span><strong style="color: #169179; font-size: 18.6667px;">on column</strong><span style="color: #169179; font-size: 18.6667px;">.&nbsp; For columns in x not in y, set to missing.</span></p>

In [8]:
pd.merge(x,y,on="Types",how="right")

Unnamed: 0,Types,Info,Stuff
0,1,a,m
1,3,c,n
2,5,e,o


In [14]:
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

In [16]:
display(df1.head())
print("----------------------------")
print(df2.head())

Unnamed: 0,Courses,Fee,Duration
r1,Spark,20000,30days
r2,PySpark,25000,40days
r3,Python,22000,35days
r4,pandas,30000,50days


----------------------------
   Courses  Discount
r1   Spark      2000
r6    Java      2300
r3  Python      1200
r5      Go      2000


In [17]:
pd.merge(df1,df2,on="Courses")

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,20000,30days,2000
1,Python,22000,35days,1200


In [18]:
pd.merge(df1,df2,on="Courses",how="outer")

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,20000.0,30days,2000.0
1,PySpark,25000.0,40days,
2,Python,22000.0,35days,1200.0
3,pandas,30000.0,50days,
4,Java,,,2300.0
5,Go,,,2000.0


In [8]:
pd.merge(df1,df2,on="Courses",how="left").head(10)

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,20000,30days,2000.0
1,PySpark,25000,40days,
2,Python,22000,35days,1200.0
3,pandas,30000,50days,


In [9]:
pd.merge(df1,df2,on="Courses",how="right").head(10)

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,20000.0,30days,2000
1,Java,,,2300
2,Python,22000.0,35days,1200
3,Go,,,2000


In [11]:
! dir data

 Volume in drive C is Windows
 Volume Serial Number is 48FA-D8E0

 Directory of C:\Users\CoJoe\Documents\MiM\ETL\Pandas-Merging\data

11/14/2022  03:01 PM    <DIR>          .
11/14/2022  03:01 PM    <DIR>          ..
10/04/2022  10:39 AM               101 ds_chars.csv
10/04/2022  10:39 AM               136 states.csv
11/13/2022  08:49 PM             9,437 user_device.csv
11/13/2022  08:48 PM             6,432 user_usage.csv
               4 File(s)         16,106 bytes
               2 Dir(s)  976,138,821,632 bytes free


In [27]:
devices = pd.read_csv("data/user_device.csv")
devices.head()

Unnamed: 0,use_id,user_id,platform,platform_version,device,use_type_id
0,22782,26980,ios,10.2,"iPhone7,2",2
1,22783,29628,android,6.0,Nexus 5,3
2,22784,28473,android,5.1,SM-G903F,1
3,22785,15200,ios,10.2,"iPhone7,2",3
4,22786,28239,android,6.0,ONE E1003,1


In [14]:
usage = pd.read_csv("data/user_usage.csv")
usage.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id
0,21.97,4.82,1557.33,22787
1,1710.08,136.88,7267.55,22788
2,1710.08,136.88,7267.55,22789
3,94.46,35.17,519.12,22790
4,71.59,79.26,1557.33,22792


In [22]:
df = pd.merge(devices,usage,on="use_id")
df.head()

Unnamed: 0,use_id,user_id,platform,platform_version,device,use_type_id,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb
0,22787,12921,android,4.3,GT-I9505,1,21.97,4.82,1557.33
1,22788,28714,android,6.0,SM-G930F,1,1710.08,136.88,7267.55
2,22789,28714,android,6.0,SM-G930F,1,1710.08,136.88,7267.55
3,22790,29592,android,5.1,D2303,1,94.46,35.17,519.12
4,22792,28217,android,5.1,SM-G361F,1,71.59,79.26,1557.33


In [24]:
df["platform"].value_counts()

android    157
ios          2
Name: platform, dtype: int64

In [26]:
df.groupby("platform").mean()

Unnamed: 0_level_0,use_id,user_id,platform_version,use_type_id,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb
platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
android,22922.350318,25913.515924,5.501911,1.0,201.258535,85.354586,4221.387834
ios,22920.5,29682.0,9.7,2.0,366.06,293.975,961.155


In [28]:
df["device"].value_counts()

SM-G900F                  30
GT-I9505                  11
ONEPLUS A3003              7
SM-G920F                   5
SM-N910F                   5
HTC Desire 510             5
SM-J320FN                  5
SM-G361F                   5
SM-G935F                   5
SM-A300FU                  4
F3111                      4
Moto G (4)                 4
SM-G925F                   4
HTC Desire 825             3
SM-G930F                   3
GT-I9515                   3
GT-I9195                   3
HTC One mini 2             3
D2303                      2
GT-N7100                   2
D6603                      2
HTC Desire 626             2
ONE A2003                  2
X11                        2
A0001                      2
SM-G903F                   2
D5503                      2
GT-I9300                   2
HTC One S                  2
SM-G360F                   2
SM-A310F                   2
iPhone6,2                  1
C6603                      1
HUAWEI CUN-L01             1
GT-I9506      

In [30]:
androids = df.loc[ df["platform"] == "android"]

In [31]:
androids["platform"].value_counts()

android    157
Name: platform, dtype: int64

In [32]:
androids.groupby("device").mean()

Unnamed: 0_level_0,use_id,user_id,platform_version,use_type_id,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb
device,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A0001,22822.5,25635.0,6.0,1.0,170.395,62.1,15573.33
C6603,23028.0,29716.0,5.1,1.0,92.52,162.39,1557.33
D2303,22822.0,29592.0,5.1,1.0,96.845,35.375,519.12
D5503,22982.5,16673.0,5.1,1.0,146.45,48.67,1557.33
D5803,22832.0,29295.0,6.0,1.0,244.88,105.95,1557.33
D6603,22920.5,18833.0,6.0,1.0,362.01,14.19,7267.55
E6653,22833.0,24847.0,6.0,1.0,135.09,42.02,5191.12
EVA-L09,22931.0,29684.0,6.0,1.0,115.26,0.92,1557.33
F3111,22916.5,29666.0,6.0,1.0,46.2625,0.415,2076.45
GT-I8190N,22831.0,6541.0,4.1,1.0,85.97,26.94,407.01


In [33]:
! dir data

 Volume in drive C is Windows
 Volume Serial Number is 48FA-D8E0

 Directory of C:\Users\CoJoe\Documents\MiM\ETL\Pandas-Merging\data

11/14/2022  03:01 PM    <DIR>          .
11/14/2022  03:01 PM    <DIR>          ..
10/04/2022  10:39 AM               101 ds_chars.csv
10/04/2022  10:39 AM               136 states.csv
11/13/2022  08:49 PM             9,437 user_device.csv
11/13/2022  08:48 PM             6,432 user_usage.csv
               4 File(s)         16,106 bytes
               2 Dir(s)  976,103,374,848 bytes free


In [20]:
ds = pd.read_csv("data/ds_chars.csv")
st = pd.read_csv("data/states.csv")

In [21]:
print(ds.head())
print(st.head())

   Unnamed: 0    name   HP home_state
0           0    greg  200         WA
1           1   miles  200         WA
2           2    alan  170         TX
3           3  alison  300         DC
4           4  rachel  200         TX
   Unnamed: 0 state   nickname     capital
0           0    WA  evergreen     Olympia
1           1    TX      alamo      Austin
2           2    DC   district  Washington
3           3    OH    buckeye    Columbus
4           4    OR     beaver       Salem


<p>pd.DataFrame.rename() - used to re-name columns in a DataFrame.</p>
<p><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html</a></p>

<p><span style="font-size: 14pt; color: #169179;"><strong>Quick Lesson</strong>:&nbsp; &nbsp;&nbsp;</span></p>
<p><span style="font-size: 14pt; color: #169179;">Why do we sometimes see <strong>pd.merge()</strong> for some methods/functions, but other times see <strong>pd.DataFrame.rename()</strong> f</span><span style="font-size: 14pt; color: #169179;">or other methods/functions?&nbsp;</span></p>
<p>&nbsp;</p>

In [51]:
ds.rename(columns={"home_state":"state"},inplace=True)
print(ds.head())


   Unnamed: 0    name   HP state
0           0    greg  200    WA
1           1   miles  200    WA
2           2    alan  170    TX
3           3  alison  300    DC
4           4  rachel  200    TX


In [50]:
st.head()

Unnamed: 0.1,Unnamed: 0,state,nickname,capital
0,0,WA,evergreen,Olympia
1,1,TX,alamo,Austin
2,2,DC,district,Washington
3,3,OH,buckeye,Columbus
4,4,OR,beaver,Salem


In [44]:
mg = pd.merge(ds,st,how="inner",on="state")

In [45]:
mg.head()

Unnamed: 0,Unnamed: 0_x,name,HP,state,Unnamed: 0_y,nickname,capital
0,0,greg,200,WA,0,evergreen,Olympia
1,1,miles,200,WA,0,evergreen,Olympia
2,2,alan,170,TX,1,alamo,Austin
3,4,rachel,200,TX,1,alamo,Austin
4,3,alison,300,DC,2,district,Washington


In [53]:
mg.groupby("name").mean()

Unnamed: 0_level_0,Unnamed: 0_x,HP,Unnamed: 0_y
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alan,2.0,170.0,1.0
alison,3.0,300.0,2.0
greg,0.0,200.0,0.0
miles,1.0,200.0,0.0
rachel,4.0,200.0,1.0


In [22]:
mov = pd.read_csv("data/movies_metadata.csv")

  mov = pd.read_csv("data/movies_metadata.csv")


In [23]:
mov.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [24]:
links = pd.read_csv("data/links_small.csv")
ratings= pd.read_csv("data/ratings_small.csv")

In [25]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [26]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [30]:
print(links.shape)
print(ratings.shape)

(9125, 3)
(100004, 4)


In [38]:
pd.merge(links,ratings,on="movieId",how="left").head(100)

Unnamed: 0,movieId,imdbId,tmdbId,userId,rating,timestamp
0,1,114709,862.0,7.0,3.0,8.518667e+08
1,1,114709,862.0,9.0,4.0,9.386292e+08
2,1,114709,862.0,13.0,5.0,1.331380e+09
3,1,114709,862.0,15.0,2.0,9.979383e+08
4,1,114709,862.0,19.0,3.0,8.551901e+08
...,...,...,...,...,...,...
95,1,114709,862.0,272.0,4.0,1.453588e+09
96,1,114709,862.0,273.0,4.5,1.466946e+09
97,1,114709,862.0,275.0,5.0,1.350254e+09
98,1,114709,862.0,282.0,4.0,1.111494e+09
