## Data Wrangling: Clean, Transform, Merge, Reshape

## Combining and merging data sets

### Database-style DataFrame merges

In [4]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use ('ggplot')

In [2]:
df1 = pd.DataFrame({'data1' : range(7), 'key' : list('bbacaab')})
df2 = pd.DataFrame({'data2' : range(20,23), 'key' : list('abd')})

In [7]:
df1 = pd.DataFrame({'data1': range(7), 'key' : list('bbacaab')})
df2 = pd.DataFrame({'data2' : range (20,23), 'key' : list('abd')})

In [8]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [9]:
df2

Unnamed: 0,data2,key
0,20,a
1,21,b
2,22,d


By default, .merge() performs an [inner join](https://www.w3schools.com/sql/sql_join.asp) between the DataFrames, using the common columns as keys.

In [17]:
df1.merge(df2) 

Unnamed: 0,data1,key,data2
0,0,b,21
1,1,b,21
2,6,b,21
3,2,a,20
4,4,a,20
5,5,a,20


In [11]:
df2_wdups = pd.DataFrame ({'data2wdups' :range (20, 24 ), 'Key':list('abda')})

That means that it returns the cartesian product of the elements with common keys: if there are duplicates, it will return all the possible combinations:

If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to specify that.

In [11]:
df3 = pd.DataFrame({'data1' : range(7), 'lkey' : list('bbacaab')})
df4 = pd.DataFrame({'data2' : range(3), 'rkey' : list('abd')})

In [12]:
df3

Unnamed: 0,data1,lkey
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [14]:
df4

Unnamed: 0,data2,rkey
0,0,a
1,1,b
2,2,d


In [16]:
df3.merge(df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


In [19]:
df3.merge(df4,  left_on='lkey', right_on='rkey', how ='outer')  # porque no tiene columnas iguales

Unnamed: 0,data1,lkey,data2,rkey
0,0.0,b,1.0,b
1,1.0,b,1.0,b
2,6.0,b,1.0,b
3,2.0,a,0.0,a
4,4.0,a,0.0,a
5,5.0,a,0.0,a
6,3.0,c,,
7,,,2.0,d


In [22]:
df3.merge(df4,  left_on='lkey', right_on='rkey', how ='left')  # cogemos las columna izquierda

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1.0,b
1,1,b,1.0,b
2,2,a,0.0,a
3,3,c,,
4,4,a,0.0,a
5,5,a,0.0,a
6,6,b,1.0,b


In [26]:
df3.merge(df4,  left_on=['lkey', 'data1'], right_on=['rkey', 'data2'])  # cseleccion de datos iguales entre columnas

Unnamed: 0,data1,lkey,data2,rkey
0,1,b,1,b


If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

In [27]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [28]:
df2

Unnamed: 0,data2,key
0,20,a
1,21,b
2,22,d


In [29]:
df2['XXX'] = 17
df2

Unnamed: 0,data2,key,XXX
0,20,a,17
1,21,b,17
2,22,d,17


In [30]:
df1['XXX'] = 42
df1

Unnamed: 0,data1,key,XXX
0,0,b,42
1,1,b,42
2,2,a,42
3,3,c,42
4,4,a,42
5,5,a,42
6,6,b,42


In [None]:
df1.merge(df2, left_on='key')

### Merging on index

In [18]:
df5= pd.DataFrame({'g': range(4), 'h': range(8,12)}, index =list('abcd'))
df5

Unnamed: 0,g,h
a,0,8
b,1,9
c,2,10
d,3,11


In [25]:
df1.merge(df5, left_on='key', right_index=True)

Unnamed: 0,data1,key,g,h
0,0,b,1,9
1,1,b,1,9
6,6,b,1,9
2,2,a,0,8
4,4,a,0,8
5,5,a,0,8
3,3,c,2,10


### Concatenating along an axis

In [36]:
s1 = pd.Series(range(4), index=list('abcd'))
s2 = pd.Series(range(10,13), index=list('lmn'))
s3 = pd.Series(range(40,43), index=list('xyz'))
s1,s2,s3

(a    0
 b    1
 c    2
 d    3
 dtype: int64, l    10
 m    11
 n    12
 dtype: int64, x    40
 y    41
 z    42
 dtype: int64)

In [39]:
pd.concat([s1, s2, s3])

a     0
b     1
c     2
d     3
l    10
m    11
n    12
x    40
y    41
z    42
dtype: int64

In [38]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,2.0,,
d,3.0,,
l,,10.0,
m,,11.0,
n,,12.0,
x,,,40.0
y,,,41.0
z,,,42.0


In [37]:
pd.concat([df1,df5])

Unnamed: 0,XXX,data1,g,h,key
0,42.0,0.0,,,b
1,42.0,1.0,,,b
2,42.0,2.0,,,a
3,42.0,3.0,,,c
4,42.0,4.0,,,a
5,42.0,5.0,,,a
6,42.0,6.0,,,b
a,,,0.0,8.0,
b,,,1.0,9.0,
c,,,2.0,10.0,


#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.

You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

## Data transformation

### Removing duplicates

In [34]:
df6 = pd.DataFrame({'key1' : ['one'] * 3 + ['two'] * 4,
                    'key2' : [1, 1, 2, 3, 3, 4, 4]})
df6

Unnamed: 0,key1,key2
0,one,1
1,one,1
2,one,2
3,two,3
4,two,3
5,two,4
6,two,4


In [44]:
df6.duplicated(subset='key1', keep='last')

NameError: name 'df6' is not defined

### Renaming axis indexes

### Discretization and binning

In [51]:
ages = pd.Series([18,20,23,34,44,51])
bins = [0,18,25,35,100]
cuts = pd.cut(ages,bins)
cuts


0      (0, 18]
1     (18, 25]
2     (18, 25]
3     (25, 35]
4    (35, 100]
5    (35, 100]
dtype: category
Categories (4, interval[int64]): [(0, 18] < (18, 25] < (25, 35] < (35, 100]]

## String manipulation

### String object methods

In [29]:
my_string = 'tocoto fue a beber'
type(my_string.split())

list

In [32]:
pp = animals = 'rhino giraffe molerat mantisshrimp cheetah mosquito whale'.split()
type(pp)

list

In [34]:
list(map(lambda x: x.capitalize(), animals))

['Rhino', 'Giraffe', 'Molerat', 'Mantisshrimp', 'Cheetah', 'Mosquito', 'Whale']

In [52]:
animals = ['giraffle', 'patp', 'mariposa', 'leon']
[animals[:1] for animals in animals]

['g', 'p', 'm', 'l']

In [53]:
animals_series = pd.Series(animals)
animals_series

0    giraffle
1        patp
2    mariposa
3        leon
dtype: object

In [54]:
animals_series + '.........casa'

0    giraffle.........casa
1        patp.........casa
2    mariposa.........casa
3        leon.........casa
dtype: object

In [55]:
animals_series.str.contains('g')

0     True
1    False
2    False
3    False
dtype: bool

In [57]:
animals_series.str[:1]

0    g
1    p
2    m
3    l
dtype: object

### Vectorized string functions in pandas

[Vectorized string functions in pandas](https://pandas.pydata.org/pandas-docs/stable/text.html) are grouped within the .str attribute of Series and Indexes. They have the same names as the regular Python string functions, but work on Series of strings.