Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 8: 'Data Wrangling: Join, Combine, and Reshape'.</br>
Link: https://wesmckinney.com/book/data-wrangling

In [2]:
import pandas as pd
import numpy as np

<p>
Merging.</br>
What is the default merge method in pandas? Run below code and analyse the result.</br>
What is the default merge key column?  </br>
</p>


In [3]:
df1 = pd.DataFrame({"key": ['a', "b", 'c'],
                     "data1": pd.Series(range(3), dtype="Int64")})

df2 = pd.DataFrame({"key": ["a", "b", "d"],
                     "data2": pd.Series(range(3, 6), dtype="Int64")})

pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,a,0,3
1,b,1,4


<p>
Perform a left merge on given dataframes explicitly mentioning the merge key column. </br>
</p>


In [4]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,key,data1,data2
0,a,0,3.0
1,b,1,4.0
2,c,2,


<p>
What if you have different names for the merge keys? Do right merge on below dataframes </br>
</p>


In [6]:
df3 = pd.DataFrame({"key_l": ['a', "b", 'c'],
                     "data1": pd.Series(range(3), dtype="Int64")})

df4 = pd.DataFrame({"key_r": ["a", "b", "d"],
                     "data2": pd.Series(range(3, 6), dtype="Int64")})

pd.merge(df3, df4, left_on = 'key_l', right_on = 'key_r', how = 'right')

Unnamed: 0,key_l,data1,key_r,data2
0,a,0.0,a,3
1,b,1.0,b,4
2,,,d,5


<p>
Do outer merge on multiple keys 'key1', 'key2' on below dataframes </br>
</p>


In [8]:
df5 = pd.DataFrame({"key1": ['a', "b", 'c'],
                    'key2': [1, 2, 3],
                     "data1": pd.Series(range(3), dtype="Int64")})

df6 = pd.DataFrame({"key1": ["a", "b", "d"],
                    'key2': [1, 4, 5],
                     "data2": pd.Series(range(3, 6), dtype="Int64")})

pd.merge(df5, df6, on = ['key1', 'key2'], how = 'outer')

Unnamed: 0,key1,key2,data1,data2
0,a,1,0.0,3.0
1,b,2,1.0,
2,c,3,2.0,
3,b,4,,4.0
4,d,5,,5.0


<p>
Now outer merge df5 and df6 on column 'key1' only and see how the overlaping column name 'key2' is displayed </br>
</p>


In [9]:
pd.merge(df5, df6, on = 'key1', how = 'outer')

Unnamed: 0,key1,key2_x,data1,key2_y,data2
0,a,1.0,0.0,1.0,3.0
1,b,2.0,1.0,4.0,4.0
2,c,3.0,2.0,,
3,d,,,5.0,5.0


<p>
Now repeat the same merge but provide a custom suffix for overlaping column name 'key2'.</br>
</p>


In [10]:
pd.merge(df5, df6, on = 'key1', suffixes = ['_left', '_right'], how = 'outer') 

Unnamed: 0,key1,key2_left,data1,key2_right,data2
0,a,1.0,0.0,1.0,3.0
1,b,2.0,1.0,4.0,4.0
2,c,3.0,2.0,,
3,d,,,5.0,5.0


<p>
Merging on Index.</br>
In some cases, the merge key(s) in a DataFrame will be found in its index (row labels).</br>
Can you perform inner merge on below dataframes on df1's column 'key' and df2's index? </br></br>

Note: DataFrame has a 'join' instance method to simplify merging by index - explore on your own about the differences between 'merge' and 'join' in pandas.
</p>


In [14]:
df1 = pd.DataFrame({"key": ['a', "b", 'c'],
                     "value": pd.Series(range(3), dtype="Int64")})

df2 = pd.DataFrame({'group_val': [2.5, 3.5]}, index = ['a', 'b'])
print(df1)
print(df2)

print(pd.merge(df1, df2, left_on = 'key', right_index = True))

  key  value
0   a      0
1   b      1
2   c      2
   group_val
a        2.5
b        3.5
  key  value  group_val
0   a      0        2.5
1   b      1        3.5


<p>
Concatenating Along an Axis. </br>
Run below piece of code and analyze the result of pandas.concat(). </br>

</p>


In [15]:
s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64

<p>
Now concatenate the Series on column axis and analyze the result. </br>
</p>


In [None]:
pd.concat([s1,s2,s3], axis=1) 

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


<p>
A potential issue with the previous code is that the concatenated pieces are not identifiable in the result.  </br>
Suppose instead you wanted to create a hierarchical index on the concatenation axis.  </br>
To do this, use the 'keys' argument: run below code and analyze the result. </br>
Think of how would you use these functionalities in real life.  </br>


</p>


In [18]:
result = pd.concat([s1, s2, s3], keys=["one", "two", "three"])
print(result)
print(result.unstack())

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: Int64
          a     b     c     d     e     f     g
one       0     1  <NA>  <NA>  <NA>  <NA>  <NA>
two    <NA>  <NA>     2     3     4  <NA>  <NA>
three  <NA>  <NA>  <NA>  <NA>  <NA>     5     6


<p>
Combining Data with Overlap. </br>
Use numpy.where() method to produce an output array where NA values in Series 'a' are replaced with values from Series 'b', without checking whether the index labels are aligned or not</br>
</p>


In [19]:
a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
               index=["f", "e", "d", "c", "b", "a"])

b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
               index=["a", "b", "c", "d", "e", "f"])

np.where(pd.isna(a), b, a) # return array

array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

<p>
What if you want to line up the values by index?  </br>
Use pandas.combine_first() method and analyze the result. </br>

</p>


In [20]:
a.combine_first(b) # return Series

a    0.0
b    4.5
c    3.5
d    0.0
e    2.5
f    5.0
dtype: float64

<p>
Now run pandas.combine_first() on below dataframes, analyze the result and how this method lines up the values by index. </br>
</p>


In [None]:
df1 = pd.DataFrame({"a": [1., np.nan, 5., np.nan],
                     "b": [np.nan, 2., np.nan, 6.],
                     "c": range(2, 18, 4)})

df2 = pd.DataFrame({"a": [5., 4., np.nan, 3., 7.],
                     "b": [np.nan, 3., 4., 6., 8.]})

print(df1.combine_first(df2)) # The output of combine_first with DataFrame objects will have the union of all rows.

     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN


<p>
# Reshaping and Pivoting >> may make it part of advanced tasks</br>
# Pivoting “Long” to “Wide” Format >> A common way to store multiple time series in databases and CSV files is what is sometimes called long or stacked format. </br>
# In this format, individual values are represented by a single row in a table rather than multiple values per row.</br>
# Continue from here!!! - add the task from there and check the rest of the chapter </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>
