## Data Wrangling: Clean, Transform, Merge, Reshape

In [8]:
import pandas as pd

## Combining and merging data sets

### Database-style DataFrame merges

In [10]:
df1 = pd.DataFrame({'data1' : range(5,12), 'key' : list('bbcaaba')})
df2 = pd.DataFrame({'data2' : range(56,59), 'key' : list('abd')})
df1

Unnamed: 0,data1,key
0,5,b
1,6,b
2,7,c
3,8,a
4,9,a
5,10,b
6,11,a


In [11]:
df2

Unnamed: 0,data2,key
0,56,a
1,57,b
2,58,d


By default, .merge() performs an [inner join](https://www.w3schools.com/sql/sql_join.asp) between the DataFrames, using the common columns as keys.

In [12]:
pd.merge(df1,df2)

Unnamed: 0,data1,key,data2
0,5,b,57
1,6,b,57
2,10,b,57
3,8,a,56
4,9,a,56
5,11,a,56


In [13]:
df1.merge(df2)

Unnamed: 0,data1,key,data2
0,5,b,57
1,6,b,57
2,10,b,57
3,8,a,56
4,9,a,56
5,11,a,56


In [14]:
df1.merge(df2,how='outer')

Unnamed: 0,data1,key,data2
0,5.0,b,57.0
1,6.0,b,57.0
2,10.0,b,57.0
3,7.0,c,
4,8.0,a,56.0
5,9.0,a,56.0
6,11.0,a,56.0
7,,d,58.0


That means that it returns the cartesian product of the elements with common keys: if there are duplicates, it will return all the possible combinations:

In [21]:
df3 = pd.DataFrame({'data2' : range(56,61), 'key' : list('abdbd')})
print(df1)
print(df3)

   data1 key
0      5   b
1      6   b
2      7   c
3      8   a
4      9   a
5     10   b
6     11   a
   data2 key
0     56   a
1     57   b
2     58   d
3     59   b
4     60   d


In [23]:
df1.merge(df3)

Unnamed: 0,data1,key,data2
0,5,b,57
1,5,b,59
2,6,b,57
3,6,b,59
4,10,b,57
5,10,b,59
6,8,a,56
7,9,a,56
8,11,a,56


If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to specify that.

In [24]:
df4 = pd.DataFrame({'data2' : range(56,61), 'rkey' : list('abdbd')})

In [25]:
df4

Unnamed: 0,data2,rkey
0,56,a
1,57,b
2,58,d
3,59,b
4,60,d


In [26]:
df1.merge(df4)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

In [28]:
df1.merge(df4, left_on='key', right_on='rkey')

Unnamed: 0,data1,key,data2,rkey
0,5,b,57,b
1,5,b,59,b
2,6,b,57,b
3,6,b,59,b
4,10,b,57,b
5,10,b,59,b
6,8,a,56,a
7,9,a,56,a
8,11,a,56,a


In [29]:
df1.merge(df3, left_on='data1', right_on='data2', how='outer')

Unnamed: 0,data1,key_x,data2,key_y
0,5.0,b,,
1,6.0,b,,
2,7.0,c,,
3,8.0,a,,
4,9.0,a,,
5,10.0,b,,
6,11.0,a,,
7,,,56.0,a
8,,,57.0,b
9,,,58.0,d


In [30]:
df1.merge(df2)

Unnamed: 0,data1,key,data2
0,5,b,57
1,6,b,57
2,10,b,57
3,8,a,56
4,9,a,56
5,11,a,56


In [31]:
df1.merge(df3, left_on='data1', right_on='data2', how='outer', suffixes=('_customer','_order'))

Unnamed: 0,data1,key_customer,data2,key_order
0,5.0,b,,
1,6.0,b,,
2,7.0,c,,
3,8.0,a,,
4,9.0,a,,
5,10.0,b,,
6,11.0,a,,
7,,,56.0,a
8,,,57.0,b
9,,,58.0,d


### Merging on index

In [32]:
df4

Unnamed: 0,data2,rkey
0,56,a
1,57,b
2,58,d
3,59,b
4,60,d


In [33]:
df4.index = range(5,10)
df4

Unnamed: 0,data2,rkey
5,56,a
6,57,b
7,58,d
8,59,b
9,60,d


In [34]:
df1

Unnamed: 0,data1,key
0,5,b
1,6,b
2,7,c
3,8,a
4,9,a
5,10,b
6,11,a


In [35]:
df1.merge(df4, left_on='data1', right_index=True)

Unnamed: 0,data1,key,data2,rkey
0,5,b,56,a
1,6,b,57,b
2,7,c,58,d
3,8,a,59,b
4,9,a,60,d


### Concatenating along an axis

In [42]:
df_concat = pd.concat([df1, df2], sort=True)

In [43]:
df_concat.loc[0]

Unnamed: 0,data1,data2,key
0,5.0,,b
0,,56.0,a


In [45]:
df_concat.reset_index()

Unnamed: 0,index,data1,data2,key
0,0,5.0,,b
1,1,6.0,,b
2,2,7.0,,c
3,3,8.0,,a
4,4,9.0,,a
5,5,10.0,,b
6,6,11.0,,a
7,0,,56.0,a
8,1,,57.0,b
9,2,,58.0,d


#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.

En caso de cargarnos algún método de Pandas o reinicias el kernel o lanzas el reload de importlib

In [46]:
from importlib import reload
reload(pd)

<module 'pandas' from '/home/matozqui/anaconda3/lib/python3.7/site-packages/pandas/__init__.py'>

You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

## Data transformation

### Removing duplicates

In [47]:
df1['key']

0    b
1    b
2    c
3    a
4    a
5    b
6    a
Name: key, dtype: object

EL primero no me lo marca como duplciado, el resto sí

In [49]:
df1['key'].duplicated()

0    False
1     True
2    False
3    False
4     True
5     True
6     True
Name: key, dtype: bool

In [50]:
df1['key'].drop_duplicates()

0    b
2    c
3    a
Name: key, dtype: object

In [52]:
df1.drop_duplicates(subset='key')

Unnamed: 0,data1,key
0,5,b
2,7,c
3,8,a


In [56]:
df1['key2'] = list('aaabbcc')

In [60]:
df1

Unnamed: 0,data1,key,key2
0,5,b,a
1,6,b,a
2,7,c,a
3,8,a,b
4,9,a,b
5,10,b,c
6,11,a,c


In [59]:
df1[['key','key2']].drop_duplicates()

Unnamed: 0,key,key2
0,b,a
2,c,a
3,a,b
5,b,c
6,a,c


In [62]:
df1.drop_duplicates(subset=['key','key2'])

Unnamed: 0,data1,key,key2
0,5,b,a
2,7,c,a
3,8,a,b
5,10,b,c
6,11,a,c


In [66]:
df1.drop_duplicates(subset=['key','key2'], keep='last')

Unnamed: 0,data1,key,key2
1,6,b,a
2,7,c,a
4,9,a,b
5,10,b,c
6,11,a,c


In [67]:
df1.drop_duplicates(subset=['key','key2'], keep=False)

Unnamed: 0,data1,key,key2
2,7,c,a
5,10,b,c
6,11,a,c


### Renaming axis indexes

In [69]:
df1.index = list('jklllds')

In [70]:
df1

Unnamed: 0,data1,key,key2
j,5,b,a
k,6,b,a
l,7,c,a
l,8,a,b
l,9,a,b
d,10,b,c
s,11,a,c


### Discretization and binning

In [71]:
import numpy as np

In [74]:
np.random.seed(42)
ages = pd.Series(np.random.randint(9,99, 50))
ages

0     60
1     23
2     80
3     69
4     29
5     91
6     95
7     83
8     83
9     96
10    32
11    11
12    30
13    61
14    10
15    96
16    38
17    46
18    10
19    72
20    68
21    29
22    41
23    84
24    66
25    30
26    97
27    57
28    67
29    50
30    68
31    88
32    23
33    70
34    70
35    55
36    70
37    59
38    63
39    72
40    11
41    59
42    15
43    29
44    81
45    47
46    26
47    12
48    97
49    68
dtype: int64

In [77]:
limits = [0, 18, 30, 45, 65, 85, 100]
pd.cut(ages, limits)

0      (45, 65]
1      (18, 30]
2      (65, 85]
3      (65, 85]
4      (18, 30]
5     (85, 100]
6     (85, 100]
7      (65, 85]
8      (65, 85]
9     (85, 100]
10     (30, 45]
11      (0, 18]
12     (18, 30]
13     (45, 65]
14      (0, 18]
15    (85, 100]
16     (30, 45]
17     (45, 65]
18      (0, 18]
19     (65, 85]
20     (65, 85]
21     (18, 30]
22     (30, 45]
23     (65, 85]
24     (65, 85]
25     (18, 30]
26    (85, 100]
27     (45, 65]
28     (65, 85]
29     (45, 65]
30     (65, 85]
31    (85, 100]
32     (18, 30]
33     (65, 85]
34     (65, 85]
35     (45, 65]
36     (65, 85]
37     (45, 65]
38     (45, 65]
39     (65, 85]
40      (0, 18]
41     (45, 65]
42      (0, 18]
43     (18, 30]
44     (65, 85]
45     (45, 65]
46     (18, 30]
47      (0, 18]
48    (85, 100]
49     (65, 85]
dtype: category
Categories (6, interval[int64]): [(0, 18] < (18, 30] < (30, 45] < (45, 65] < (65, 85] < (85, 100]]

## String manipulation

### String object methods

In [84]:
cosmos_df = pd.Series(np.random.choice(['Pluto','Jupyter','Mars','Earth the true planet','Milky Way','ISS'], 60))

In [85]:
cosmos_df.upper()

AttributeError: 'Series' object has no attribute 'upper'

In [86]:
cosmos_df.str.upper()

0                       ISS
1     EARTH THE TRUE PLANET
2                      MARS
3                      MARS
4                     PLUTO
5                       ISS
6                 MILKY WAY
7     EARTH THE TRUE PLANET
8                   JUPYTER
9                       ISS
10                      ISS
11                     MARS
12                    PLUTO
13                    PLUTO
14    EARTH THE TRUE PLANET
15                     MARS
16                      ISS
17                MILKY WAY
18                     MARS
19    EARTH THE TRUE PLANET
20    EARTH THE TRUE PLANET
21                     MARS
22    EARTH THE TRUE PLANET
23                     MARS
24                  JUPYTER
25                     MARS
26                     MARS
27    EARTH THE TRUE PLANET
28    EARTH THE TRUE PLANET
29                    PLUTO
30                    PLUTO
31                  JUPYTER
32                    PLUTO
33                     MARS
34    EARTH THE TRUE PLANET
35                  

### Vectorized string functions in pandas

[Vectorized string functions in pandas](https://pandas.pydata.org/pandas-docs/stable/text.html) are grouped within the .str attribute of Series and Indexes. They have the same names as the regular Python string functions, but work on Series of strings.
Vectorizar es que entra dentro de cada elemento del objeto.

In [87]:
cosmos_df.str.lower()

0                       iss
1     earth the true planet
2                      mars
3                      mars
4                     pluto
5                       iss
6                 milky way
7     earth the true planet
8                   jupyter
9                       iss
10                      iss
11                     mars
12                    pluto
13                    pluto
14    earth the true planet
15                     mars
16                      iss
17                milky way
18                     mars
19    earth the true planet
20    earth the true planet
21                     mars
22    earth the true planet
23                     mars
24                  jupyter
25                     mars
26                     mars
27    earth the true planet
28    earth the true planet
29                    pluto
30                    pluto
31                  jupyter
32                    pluto
33                     mars
34    earth the true planet
35                  

In [88]:
cosmos_df.str.len()

0      3
1     21
2      4
3      4
4      5
5      3
6      9
7     21
8      7
9      3
10     3
11     4
12     5
13     5
14    21
15     4
16     3
17     9
18     4
19    21
20    21
21     4
22    21
23     4
24     7
25     4
26     4
27    21
28    21
29     5
30     5
31     7
32     5
33     4
34    21
35     5
36     5
37     7
38     3
39     7
40     4
41    21
42     7
43     5
44    21
45    21
46     5
47     7
48     5
49     3
50    21
51     9
52     9
53     4
54     5
55     5
56     4
57     4
58     4
59    21
dtype: int64

In [89]:
cosmos_df.str.split()

0                          [ISS]
1     [Earth, the, true, planet]
2                         [Mars]
3                         [Mars]
4                        [Pluto]
5                          [ISS]
6                   [Milky, Way]
7     [Earth, the, true, planet]
8                      [Jupyter]
9                          [ISS]
10                         [ISS]
11                        [Mars]
12                       [Pluto]
13                       [Pluto]
14    [Earth, the, true, planet]
15                        [Mars]
16                         [ISS]
17                  [Milky, Way]
18                        [Mars]
19    [Earth, the, true, planet]
20    [Earth, the, true, planet]
21                        [Mars]
22    [Earth, the, true, planet]
23                        [Mars]
24                     [Jupyter]
25                        [Mars]
26                        [Mars]
27    [Earth, the, true, planet]
28    [Earth, the, true, planet]
29                       [Pluto]
30        

In [90]:
cosmos_df.str[:3]

0     ISS
1     Ear
2     Mar
3     Mar
4     Plu
5     ISS
6     Mil
7     Ear
8     Jup
9     ISS
10    ISS
11    Mar
12    Plu
13    Plu
14    Ear
15    Mar
16    ISS
17    Mil
18    Mar
19    Ear
20    Ear
21    Mar
22    Ear
23    Mar
24    Jup
25    Mar
26    Mar
27    Ear
28    Ear
29    Plu
30    Plu
31    Jup
32    Plu
33    Mar
34    Ear
35    Plu
36    Plu
37    Jup
38    ISS
39    Jup
40    Mar
41    Ear
42    Jup
43    Plu
44    Ear
45    Ear
46    Plu
47    Jup
48    Plu
49    ISS
50    Ear
51    Mil
52    Mil
53    Mar
54    Plu
55    Plu
56    Mar
57    Mar
58    Mar
59    Ear
dtype: object

In [94]:
df_bikes_craches = pd.read_csv('./AccidentesBicicletas_2019.csv', encoding='latin1', sep=';')

In [95]:
df_bikes_craches

Unnamed: 0,Nº EXPEDIENTE,FECHA,HORA,CALLE,NÚMERO,DISTRITO,TIPO ACCIDENTE,ESTADO METEREOLÓGICO,TIPO VEHÍCULO,TIPO PERSONA,RANGO DE EDAD,SEXO,LESIVIDAD*
0,2019S000659,01/01/2019,14:00,CALL. CASTELLO / CALL. DON RAMON DE LA CRUZ,-,SALAMANCA,Alcance,Despejado,Bicicleta,Conductor,DE 25 A 29 AÑOS,Hombre,1.0
1,2019S000036,02/01/2019,20:45,AVDA. GRAN VIA DE HORTALEZA / GTA. LUIS ROSALES,-,HORTALEZA,Colisión fronto-lateral,Despejado,Bicicleta,Conductor,DE 70 A 74 AÑOS,Hombre,3.0
2,2019S000133,03/01/2019,14:30,CALL. FELIPE ALVAREZ,10,VILLA DE VALLECAS,Alcance,Se desconoce,Bicicleta,Conductor,DE 15 A 17 AÑOS,Hombre,7.0
3,2019S000132,03/01/2019,12:45,AVDA. SANTA EUGENIA / CALL. REAL DE ARGANDA,-,VILLA DE VALLECAS,Alcance,Despejado,Bicicleta,Conductor,DE 18 A 20 AÑOS,Hombre,7.0
4,2019S000132,03/01/2019,12:45,AVDA. SANTA EUGENIA / CALL. REAL DE ARGANDA,-,VILLA DE VALLECAS,Alcance,Despejado,Bicicleta,Conductor,DE 21 A 24 AÑOS,Hombre,14.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
774,2019S034457,29/10/2019,20:10,CALL. ARTURO SORIA,334A,CIUDAD LINEAL,Caída,Despejado,Bicicleta,Conductor,DE 40 A 44 AÑOS,Hombre,7.0
775,2019S034336,29/10/2019,11:00,CALL. SAN CONRADO / AVDA. MANZANARES,-,LATINA,Colisión fronto-lateral,Despejado,Bicicleta,Conductor,DE 25 A 29 AÑOS,Hombre,
776,2019S034331,30/10/2019,10:00,RONDA. SEGOVIA / CALL. SEGOVIA,-,CENTRO,Alcance,Despejado,Bicicleta,Conductor,DE 50 A 54 AÑOS,Mujer,7.0
777,2019S034555,31/10/2019,21:05,PASEO. PRADO,40,CENTRO,Alcance,Despejado,Bicicleta,Conductor,DE 21 A 24 AÑOS,Hombre,2.0


Trabajamos con Series mejor porque los indices son inmutables

In [99]:
head_bikes = pd.Series(df_bikes_craches.columns)

In [101]:
head_bikes

0           Nº  EXPEDIENTE
1                    FECHA
2                     HORA
3                    CALLE
4                   NÚMERO
5                 DISTRITO
6           TIPO ACCIDENTE
7     ESTADO METEREOLÓGICO
8            TIPO VEHÍCULO
9             TIPO PERSONA
10           RANGO DE EDAD
11                    SEXO
12              LESIVIDAD*
dtype: object

In [114]:
(head_bikes.str.replace(" ","")).str.lower()

0            nºexpediente
1                   fecha
2                    hora
3                   calle
4                  número
5                distrito
6           tipoaccidente
7     estadometereológico
8            tipovehículo
9             tipopersona
10            rangodeedad
11                   sexo
12             lesividad*
dtype: object

In [105]:
head_bikes.str.isalnum()

0     False
1      True
2      True
3      True
4      True
5      True
6     False
7     False
8     False
9     False
10    False
11     True
12    False
dtype: bool

In [137]:
(head_bikes.str.replace("[^\w-]","")).str.lower()

0            nºexpediente
1                   fecha
2                    hora
3                   calle
4                  número
5                distrito
6           tipoaccidente
7     estadometereológico
8            tipovehículo
9             tipopersona
10            rangodeedad
11                   sexo
12              lesividad
dtype: object

In [164]:
index_series = head_bikes.str.split().apply(lambda x: x[-1]).str.replace('*','').str.lower()\
          .str.replace('ú','u').str.replace('ó','o')

In [165]:
head_bikes.columns = index_series

In [None]:
head_bike