## Data Wrangling: Clean, Transform, Merge, Reshape

In [10]:
import pandas as pd
import numpy as np

## Combining and merging data sets

### Database-style DataFrame merges

In [12]:
df1 = pd.DataFrame({'data1':range(5,12),'key':list('bbacaab')})
df2 = pd.DataFrame({'data2':range(56,59),'key':list('abd')})
df1

Unnamed: 0,data1,key
0,5,b
1,6,b
2,7,a
3,8,c
4,9,a
5,10,a
6,11,b


In [6]:
df2

Unnamed: 0,data2,key
0,56,a
1,57,b
2,58,d


By default, .merge() performs an [inner join](https://www.w3schools.com/sql/sql_join.asp) between the DataFrames, using the common columns as keys.

In [14]:
df1.merge(df2,how='left')

Unnamed: 0,data1,key,data2
0,5,b,57.0
1,6,b,57.0
2,7,a,56.0
3,8,c,
4,9,a,56.0
5,10,a,56.0
6,11,b,57.0


That means that it returns the cartesian product of the elements with common keys: if there are duplicates, it will return all the possible combinations:

In [20]:
df3 = pd.DataFrame({'data2':range(56,61),'key':list('abdbd')})
df1.merge(df3)

Unnamed: 0,data1,key,data2
0,5,b,57
1,5,b,59
2,6,b,57
3,6,b,59
4,11,b,57
5,11,b,59
6,7,a,56
7,9,a,56
8,10,a,56


If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to specify that.

In [23]:
df4 = pd.DataFrame({'data2':range(56,61),'rkey':list('abdbd')})
df1.merge(df4,left_on='key',right_on='rkey')

Unnamed: 0,data1,key,data2,rkey
0,5,b,57,b
1,5,b,59,b
2,6,b,57,b
3,6,b,59,b
4,11,b,57,b
5,11,b,59,b
6,7,a,56,a
7,9,a,56,a
8,10,a,56,a


If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

In [26]:
df1.merge(df2,left_on='data1',right_on='data2', how='outer', suffixes=['_customer','_order'])

Unnamed: 0,data1,key_customer,data2,key_order
0,5.0,b,,
1,6.0,b,,
2,7.0,a,,
3,8.0,c,,
4,9.0,a,,
5,10.0,a,,
6,11.0,b,,
7,,,56.0,a
8,,,57.0,b
9,,,58.0,d


### Merging on index

In [29]:
df4.index = range(5,10)
df4

Unnamed: 0,data2,rkey
5,56,a
6,57,b
7,58,d
8,59,b
9,60,d


In [33]:
df1

Unnamed: 0,data1,key
0,5,b
1,6,b
2,7,a
3,8,c
4,9,a
5,10,a
6,11,b


In [32]:
df1.merge(df4,left_on='data1',right_index=True)

Unnamed: 0,data1,key,data2,rkey
0,5,b,56,a
1,6,b,57,b
2,7,a,58,d
3,8,c,59,b
4,9,a,60,d


### Concatenating along an axis

In [34]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,data1,key,data2,key.1
0,5,b,56.0,a
1,6,b,57.0,b
2,7,a,58.0,d
3,8,c,,
4,9,a,,
5,10,a,,
6,11,b,,


In [35]:
pd.concat([df1,df4],axis=1)

Unnamed: 0,data1,key,data2,rkey
0,5.0,b,,
1,6.0,b,,
2,7.0,a,,
3,8.0,c,,
4,9.0,a,,
5,10.0,a,56.0,a
6,11.0,b,57.0,b
7,,,58.0,d
8,,,59.0,b
9,,,60.0,d


In [39]:
pd.concat([df1['data1'],df2['data2'],df3['data2']], axis=1)

Unnamed: 0,data1,data2,data2.1
0,5,56.0,56.0
1,6,57.0,57.0
2,7,58.0,58.0
3,8,,59.0
4,9,,60.0
5,10,,
6,11,,


#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.
Esto seria en el caso que quiera hacer un reload del modulo, porque me pude haber cargado algún metodo, asignandole un valor
ej: np.merge = 1

In [48]:
from importlib import reload
reload(pd)

<module 'pandas' from '/usr/local/lib/python3.7/site-packages/pandas/__init__.py'>

You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

## Data transformation

### Removing duplicates

In [51]:
df1['key'].duplicated()

0    False
1     True
2    False
3    False
4     True
5     True
6     True
Name: key, dtype: bool

In [52]:
df1['key'].drop_duplicates()

0    b
2    a
3    c
Name: key, dtype: object

In [58]:
df1.drop_duplicates(subset='key', keep='first') # me quedo con la fila pero solo considero la col key, # keep el first o el last

Unnamed: 0,data1,key
0,5,b
2,7,a
3,8,c


### Renaming axis indexes

In [64]:
df1.index = list('lolilos')
df1

Unnamed: 0,data1,key
l,5,b
o,6,b
l,7,a
i,8,c
l,9,a
o,10,a
s,11,b


### Discretization and binning

In [67]:
import numpy as np
np.random.seed(42)
ages = pd.Series(np.random.randint(9,99,50))

In [68]:
limits = [14,18,35,50,65]

In [72]:
categorized = pd.cut(ages,limits, include_lowest=True)
categorized

0       (50.0, 65.0]
1       (18.0, 35.0]
2                NaN
3                NaN
4       (18.0, 35.0]
5                NaN
6                NaN
7                NaN
8                NaN
9                NaN
10      (18.0, 35.0]
11               NaN
12      (18.0, 35.0]
13      (50.0, 65.0]
14               NaN
15               NaN
16      (35.0, 50.0]
17      (35.0, 50.0]
18               NaN
19               NaN
20               NaN
21      (18.0, 35.0]
22      (35.0, 50.0]
23               NaN
24               NaN
25      (18.0, 35.0]
26               NaN
27      (50.0, 65.0]
28               NaN
29      (35.0, 50.0]
30               NaN
31               NaN
32      (18.0, 35.0]
33               NaN
34               NaN
35      (50.0, 65.0]
36               NaN
37      (50.0, 65.0]
38      (50.0, 65.0]
39               NaN
40               NaN
41      (50.0, 65.0]
42    (13.999, 18.0]
43      (18.0, 35.0]
44               NaN
45      (35.0, 50.0]
46      (18.0, 35.0]
47           

In [75]:
pd.cut(ages,limits).value_counts()

(18, 35]    9
(50, 65]    7
(35, 50]    5
(14, 18]    1
dtype: int64

## String manipulation

### String object methods

In [84]:
bichos = pd.Series(np.random.choice(['Mantis Shrimp', 'Naked Mole Rat','Star Noded Mole'],15))
bichos.str.upper()
bichos.str.lower()
bichos.str.len()
bichos.str.split()

0     [Star, Noded, Mole]
1        [Mantis, Shrimp]
2     [Star, Noded, Mole]
3     [Star, Noded, Mole]
4      [Naked, Mole, Rat]
5      [Naked, Mole, Rat]
6        [Mantis, Shrimp]
7     [Star, Noded, Mole]
8     [Star, Noded, Mole]
9     [Star, Noded, Mole]
10       [Mantis, Shrimp]
11       [Mantis, Shrimp]
12     [Naked, Mole, Rat]
13       [Mantis, Shrimp]
14    [Star, Noded, Mole]
dtype: object

In [89]:
bichos.str.split().str[1]

0      Noded
1     Shrimp
2      Noded
3      Noded
4       Mole
5       Mole
6     Shrimp
7      Noded
8      Noded
9      Noded
10    Shrimp
11    Shrimp
12      Mole
13    Shrimp
14     Noded
dtype: object

### Vectorized string functions in pandas

[Vectorized string functions in pandas](https://pandas.pydata.org/pandas-docs/stable/text.html) are grouped within the .str attribute of Series and Indexes. They have the same names as the regular Python string functions, but work on Series of strings.