# Intro Pandas (Panel Data)

**[Documentación](https://pandas.pydata.org/docs/reference/index.html#api)**

**[Código fuente](https://github.com/pandas-dev/pandas)**


![pandas](../../images/pandas.png)


Pandas es una librería de python especializada en el manejo y análisis de estructuras de datos.


Las principales características de esta librería son:

+ Define nuevas estructuras de datos basadas en los arrays de la librería NumPy pero con nuevas funcionalidades.
+ Permite leer y escribir fácilmente ficheros en formato CSV, Excel y bases de datos SQL.
+ Permite acceder a los datos mediante índices o nombres para filas y columnas.
+ Ofrece métodos para reordenar, dividir y combinar conjuntos de datos.
+ Permite trabajar con series temporales.
+ Realiza todas estas operaciones de manera muy eficiente.


**Tipos de datos de Pandas**
Pandas dispone de dos estructuras de datos diferentes:

+ Series: Estructura de una dimensión.
+ DataFrame: Estructura de dos dimensiones (tablas).

Estas estructuras se construyen a partir de arrays de la librería NumPy, añadiendo nuevas funcionalidades.

In [1]:
!pip install pandas



In [2]:
import pandas as pd
import numpy as np

In [3]:
import warnings
warnings.filterwarnings('ignore') # esto es para evitar que se imprimar los mensajes de aviso

### Serie

Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir, sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir, no se puede cambiar, aunque si su contenido.

Dispone de un índice que asocia un nombre a cada elemento del la serie, a través de la cuál se accede al elemento.

In [4]:
lst = [(3.4 + i) ** 2 for i in range(10)]

lst

[11.559999999999999,
 19.360000000000003,
 29.160000000000004,
 40.96000000000001,
 54.760000000000005,
 70.56,
 88.36000000000001,
 108.16000000000001,
 129.96,
 153.76000000000002]

In [5]:
type(lst)

list

In [6]:
serie = pd.Series(lst)

serie

0     11.56
1     19.36
2     29.16
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [7]:
type(serie)

pandas.core.series.Series

In [9]:
serie.head(2) # nos muestra los 5 primeros elementos de la serie

0    11.56
1    19.36
dtype: float64

In [10]:
serie.tail(2) # nos muestra los últimos cinco elementos de la serie

8    129.96
9    153.76
dtype: float64

In [11]:
serie.index

RangeIndex(start=0, stop=10, step=1)

**Renombar indices de una serie**

In [12]:
serie.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

serie

a     11.56
b     19.36
c     29.16
d     40.96
e     54.76
f     70.56
g     88.36
h    108.16
i    129.96
j    153.76
dtype: float64

In [13]:
serie.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

In [15]:
#help(serie)

### DataFrame

Un objeto del tipo DataFrame define un conjunto de datos estructurado en forma de tabla donde cada columna es un objeto de tipo Series, es decir, todos los datos de una misma columna son del mismo tipo, y las filas son registros que pueden contender datos de distintos tipos.

Un DataFrame contiene dos índices, uno para las filas y otro para las columnas, y se puede acceder a sus elementos mediante los nombres de las filas y las columnas.

In [16]:
columnas = ['col1', 'col2', 'col3', 'col4', 'col5']

data = np.random.random((10,5))

data

array([[0.97496633, 0.80630749, 0.01744064, 0.92846417, 0.54287339],
       [0.1953008 , 0.61264177, 0.87095326, 0.01401954, 0.4258402 ],
       [0.52683593, 0.02156816, 0.05729982, 0.2660827 , 0.57191195],
       [0.0953584 , 0.37728827, 0.98508254, 0.60804265, 0.94000814],
       [0.76566227, 0.6201061 , 0.55361627, 0.3092511 , 0.29793147],
       [0.41346485, 0.32649031, 0.86668659, 0.05981241, 0.31661119],
       [0.29718217, 0.87484695, 0.87209674, 0.08064813, 0.44907063],
       [0.49010664, 0.70812066, 0.63167799, 0.9818607 , 0.32345317],
       [0.46437337, 0.69853673, 0.58124167, 0.67233835, 0.2590392 ],
       [0.17512075, 0.67113399, 0.87652213, 0.71382187, 0.25765006]])

In [20]:
df = pd.DataFrame(data ,columns=columnas)

df.head()

Unnamed: 0,col1,col2,col3,col4,col5
0,0.974966,0.806307,0.017441,0.928464,0.542873
1,0.195301,0.612642,0.870953,0.01402,0.42584
2,0.526836,0.021568,0.0573,0.266083,0.571912
3,0.095358,0.377288,0.985083,0.608043,0.940008
4,0.765662,0.620106,0.553616,0.309251,0.297931


**Para acceder a una columna en particular**

In [21]:
df['col2']

0    0.806307
1    0.612642
2    0.021568
3    0.377288
4    0.620106
5    0.326490
6    0.874847
7    0.708121
8    0.698537
9    0.671134
Name: col2, dtype: float64

In [22]:
df.col2

0    0.806307
1    0.612642
2    0.021568
3    0.377288
4    0.620106
5    0.326490
6    0.874847
7    0.708121
8    0.698537
9    0.671134
Name: col2, dtype: float64

In [23]:
type(df.col2)

pandas.core.series.Series

In [24]:
type(df)

pandas.core.frame.DataFrame

**Mostrar elementos de una columna de un dataframe**

In [25]:
df['col2']

0    0.806307
1    0.612642
2    0.021568
3    0.377288
4    0.620106
5    0.326490
6    0.874847
7    0.708121
8    0.698537
9    0.671134
Name: col2, dtype: float64

In [26]:
df.col3

0    0.017441
1    0.870953
2    0.057300
3    0.985083
4    0.553616
5    0.866687
6    0.872097
7    0.631678
8    0.581242
9    0.876522
Name: col3, dtype: float64

**Renombrar columnas de un dataframe**

Para renombrar columnas de un dataframe usaremos el método rename de pandas que recibe como argumento un diccionario en el que la clave será el nombre de la columna antigua y el valor el nombre de la nueva columna

In [27]:
df = df.rename(columns={'col2': 'columna_2'})

In [28]:
df.head()

Unnamed: 0,col1,columna_2,col3,col4,col5
0,0.974966,0.806307,0.017441,0.928464,0.542873
1,0.195301,0.612642,0.870953,0.01402,0.42584
2,0.526836,0.021568,0.0573,0.266083,0.571912
3,0.095358,0.377288,0.985083,0.608043,0.940008
4,0.765662,0.620106,0.553616,0.309251,0.297931


Usando el parámetro inplace dentro del método rename estamos sobreescribiendo nuestro dataframe sin tener que igualarlo a una variable

In [29]:
df.rename(columns={'col3': 'columna_3'}, inplace=True)

In [30]:
df.head()

Unnamed: 0,col1,columna_2,columna_3,col4,col5
0,0.974966,0.806307,0.017441,0.928464,0.542873
1,0.195301,0.612642,0.870953,0.01402,0.42584
2,0.526836,0.021568,0.0573,0.266083,0.571912
3,0.095358,0.377288,0.985083,0.608043,0.940008
4,0.765662,0.620106,0.553616,0.309251,0.297931


In [31]:
new_cols = ['col1_a', 'col2_a', 'col3_a', 'col4_a', 'col5_a']

df.columns = new_cols

df.head()

Unnamed: 0,col1_a,col2_a,col3_a,col4_a,col5_a
0,0.974966,0.806307,0.017441,0.928464,0.542873
1,0.195301,0.612642,0.870953,0.01402,0.42584
2,0.526836,0.021568,0.0573,0.266083,0.571912
3,0.095358,0.377288,0.985083,0.608043,0.940008
4,0.765662,0.620106,0.553616,0.309251,0.297931


In [32]:
df.columns

Index(['col1_a', 'col2_a', 'col3_a', 'col4_a', 'col5_a'], dtype='object')

**Selección de varias columnas**

Para seleccionar varias columnas de un dataframe llamaremos a nuestro dataframe y usando [] le pasaremos una lista de las columnas que queremos seleccionar, se no generará un dataframe con esas columnas mostrándolas en el orden de la lista

In [33]:
df_filtrado = df[['col3_a', 'col1_a', 'col5_a']]

In [34]:
df_filtrado

Unnamed: 0,col3_a,col1_a,col5_a
0,0.017441,0.974966,0.542873
1,0.870953,0.195301,0.42584
2,0.0573,0.526836,0.571912
3,0.985083,0.095358,0.940008
4,0.553616,0.765662,0.297931
5,0.866687,0.413465,0.316611
6,0.872097,0.297182,0.449071
7,0.631678,0.490107,0.323453
8,0.581242,0.464373,0.259039
9,0.876522,0.175121,0.25765


**Añadir una columna nueva**

In [35]:
df['nueva_columna'] = 0 # añadimos una columna llena de ceros

In [36]:
df.head()

Unnamed: 0,col1_a,col2_a,col3_a,col4_a,col5_a,nueva_columna
0,0.974966,0.806307,0.017441,0.928464,0.542873,0
1,0.195301,0.612642,0.870953,0.01402,0.42584,0
2,0.526836,0.021568,0.0573,0.266083,0.571912,0
3,0.095358,0.377288,0.985083,0.608043,0.940008,0
4,0.765662,0.620106,0.553616,0.309251,0.297931,0


In [37]:
[i if i%2 else None for i in range(len(df))]

[None, 1, None, 3, None, 5, None, 7, None, 9]

In [None]:
lst= []
for i in range(len(df)):
    if i%2:
        lst.append(i)
    else:
        lst.append(None)
lst

In [38]:
df['col_con_nulos'] = [i if i%2 else None for i in range(len(df))] # añadimos una columna con valores nulos

In [39]:
df

Unnamed: 0,col1_a,col2_a,col3_a,col4_a,col5_a,nueva_columna,col_con_nulos
0,0.974966,0.806307,0.017441,0.928464,0.542873,0,
1,0.195301,0.612642,0.870953,0.01402,0.42584,0,1.0
2,0.526836,0.021568,0.0573,0.266083,0.571912,0,
3,0.095358,0.377288,0.985083,0.608043,0.940008,0,3.0
4,0.765662,0.620106,0.553616,0.309251,0.297931,0,
5,0.413465,0.32649,0.866687,0.059812,0.316611,0,5.0
6,0.297182,0.874847,0.872097,0.080648,0.449071,0,
7,0.490107,0.708121,0.631678,0.981861,0.323453,0,7.0
8,0.464373,0.698537,0.581242,0.672338,0.259039,0,
9,0.175121,0.671134,0.876522,0.713822,0.25765,0,9.0


In [40]:
df.col_con_nulos[0]

nan

In [41]:
type(df.col_con_nulos[0])

numpy.float64

**Mostrar la info de nuestro dataframe**

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   col1_a         10 non-null     float64
 1   col2_a         10 non-null     float64
 2   col3_a         10 non-null     float64
 3   col4_a         10 non-null     float64
 4   col5_a         10 non-null     float64
 5   nueva_columna  10 non-null     int64  
 6   col_con_nulos  5 non-null      float64
dtypes: float64(6), int64(1)
memory usage: 688.0 bytes


**Generar un dataframe vacio**

In [44]:
empty_df = pd.DataFrame()

In [45]:
empty_df

**Generar un dataframe a partir de una lista de listas**

In [46]:
lst_lst = [[655643, 'buenas', 35432],
          [354, 'como andas', 899],
          [3543, 'ooi']]

columnas = ['num', 'str', 'otro_num']

df_lst = pd.DataFrame(lst_lst, columns=columnas)

df_lst

Unnamed: 0,num,str,otro_num
0,655643,buenas,35432.0
1,354,como andas,899.0
2,3543,ooi,


**Generar un dataframe a partir de un diccionario**

In [47]:
dictio = {'casa':lst_lst[0],
         'oficina': lst_lst[1],
         'numero': lst_lst[2]+[0]}

df_dictio = pd.DataFrame(dictio)

df_dictio

Unnamed: 0,casa,oficina,numero
0,655643,354,3543
1,buenas,como andas,ooi
2,35432,899,0


In [52]:
df_dictio.set_index('casa')

Unnamed: 0_level_0,oficina,numero
casa,Unnamed: 1_level_1,Unnamed: 2_level_1
655643,354,3543
buenas,como andas,ooi
35432,899,0


### Operaciones con Dataframes

**Transponer**

In [53]:
df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
col1_a,0.974966,0.195301,0.526836,0.095358,0.765662,0.413465,0.297182,0.490107,0.464373,0.175121
col2_a,0.806307,0.612642,0.021568,0.377288,0.620106,0.32649,0.874847,0.708121,0.698537,0.671134
col3_a,0.017441,0.870953,0.0573,0.985083,0.553616,0.866687,0.872097,0.631678,0.581242,0.876522
col4_a,0.928464,0.01402,0.266083,0.608043,0.309251,0.059812,0.080648,0.981861,0.672338,0.713822
col5_a,0.542873,0.42584,0.571912,0.940008,0.297931,0.316611,0.449071,0.323453,0.259039,0.25765
nueva_columna,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
col_con_nulos,,1.0,,3.0,,5.0,,7.0,,9.0


In [54]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
col1_a,0.974966,0.195301,0.526836,0.095358,0.765662,0.413465,0.297182,0.490107,0.464373,0.175121
col2_a,0.806307,0.612642,0.021568,0.377288,0.620106,0.32649,0.874847,0.708121,0.698537,0.671134
col3_a,0.017441,0.870953,0.0573,0.985083,0.553616,0.866687,0.872097,0.631678,0.581242,0.876522
col4_a,0.928464,0.01402,0.266083,0.608043,0.309251,0.059812,0.080648,0.981861,0.672338,0.713822
col5_a,0.542873,0.42584,0.571912,0.940008,0.297931,0.316611,0.449071,0.323453,0.259039,0.25765
nueva_columna,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
col_con_nulos,,1.0,,3.0,,5.0,,7.0,,9.0


**Sumar**

In [55]:
df.sum()

col1_a            4.398372
col2_a            5.717040
col3_a            6.312618
col4_a            4.634342
col5_a            4.384389
nueva_columna     0.000000
col_con_nulos    25.000000
dtype: float64

**Máximo, mínimo, desviación standard, media, mediana, moda**

In [56]:
df.max()

col1_a           0.974966
col2_a           0.874847
col3_a           0.985083
col4_a           0.981861
col5_a           0.940008
nueva_columna    0.000000
col_con_nulos    9.000000
dtype: float64

In [57]:
df.min()

col1_a           0.095358
col2_a           0.021568
col3_a           0.017441
col4_a           0.014020
col5_a           0.257650
nueva_columna    0.000000
col_con_nulos    1.000000
dtype: float64

In [58]:
df.std()

col1_a           0.273416
col2_a           0.257361
col3_a           0.344957
col4_a           0.363072
col5_a           0.209029
nueva_columna    0.000000
col_con_nulos    3.162278
dtype: float64

In [59]:
df.mean()

col1_a           0.439837
col2_a           0.571704
col3_a           0.631262
col4_a           0.463434
col5_a           0.438439
nueva_columna    0.000000
col_con_nulos    5.000000
dtype: float64

In [60]:
df.median()

col1_a           0.438919
col2_a           0.645620
col3_a           0.749182
col4_a           0.458647
col5_a           0.374647
nueva_columna    0.000000
col_con_nulos    5.000000
dtype: float64

In [62]:
df['col1_a'].median()

0.4389191116555385

In [66]:
pd.DataFrame(np.ones((5,2))).mean()

0    1.0
1    1.0
dtype: float64