<a href="https://colab.research.google.com/github/pmendizabal/crash_course_on_analytics_with_python/blob/main/Pandas_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#¿Qué es pandas?

Pandas es una librería de Python especializada en el tratamiento de datos para realizar múltiples analisis. El desarrollador Wes McKinney comenzó a trabajar con pandas en 2008 mientras estaba en AQR Capital Management debido a la necesidad de una herramienta flexible de alto rendimiento para realizar análisis cuantitativos de datos financieros. Antes de dejar AQR, pudo convencer a la gerencia para que le permitiera abrir la biblioteca con código abierto.

Fuente: https://en.wikipedia.org/wiki/Pandas_(software)

###Dataframe

Es la estructura por excelencia dentro de pandas, la cual se homologa a aquellas que se hacen en Excel o SQL

In [1]:
import pandas as pd

#Creamos el objeto Dataframe de forma manual 

df = pd.DataFrame({"economist": ["Marx",
                                 "Ricardo",
                                 "Smith"],
                   "death_year": [1883, 1823, 1790],
                   "book": ["The Capital","On the Principles of Political Economy and Taxation",
                            "The Wealth of Nations"]})

df

Unnamed: 0,economist,death_year,book
0,Marx,1883,The Capital
1,Ricardo,1823,On the Principles of Political Economy and Tax...
2,Smith,1790,The Wealth of Nations


Es evidente notar de que si tenemos un diccionario de listas, es sencillo transformarlo en dataframe. En este caso, las keys del diccionario seran usados como nombre de las columnas.

In [2]:
econ_dict = {"economist": ["Marx",
                                 "Ricardo",
                                 "Smith"],
                   "death_year": [1883, 1823, 1790],
                   "book": ["The Capital","On the Principles of Political Economy and Taxation",
                            "The Wealth of Nations"]}
econ_dict

{'book': ['The Capital',
  'On the Principles of Political Economy and Taxation',
  'The Wealth of Nations'],
 'death_year': [1883, 1823, 1790],
 'economist': ['Marx', 'Ricardo', 'Smith']}

In [3]:
econ_df = pd.DataFrame(econ_dict)
econ_df

Unnamed: 0,economist,death_year,book
0,Marx,1883,The Capital
1,Ricardo,1823,On the Principles of Political Economy and Tax...
2,Smith,1790,The Wealth of Nations


In [29]:
#Crear DF de series
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df_2 = pd.DataFrame(d)
df_2

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [31]:
#Acceder a indice y columnas

df_2.index
df_2.columns

Index(['one', 'two'], dtype='object')

In [32]:
#Crear DF mediante listas de diccionarios

raw_data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

pd.DataFrame(raw_data)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [33]:
pd.DataFrame(raw_data, columns=['one','two','three'])

Unnamed: 0,one,two,three
0,,,
1,,,


In [34]:
#Selección de columnas y otras adiciones

econ_df['book']

0                                          The Capital
1    On the Principles of Political Economy and Tax...
2                                The Wealth of Nations
Name: book, dtype: object

In [41]:
#Generar nuevas columnas
econ_df['oldest'] = econ_df['death_year'] < 1800 
econ_df

Unnamed: 0,economist,death_year,book,oldest
0,Marx,1883,The Capital,False
1,Ricardo,1823,On the Principles of Political Economy and Tax...,False
2,Smith,1790,The Wealth of Nations,True


In [42]:
#Borrar columnas

del econ_df['oldest']
econ_df

Unnamed: 0,economist,death_year,book
0,Marx,1883,The Capital
1,Ricardo,1823,On the Principles of Political Economy and Tax...
2,Smith,1790,The Wealth of Nations


In [43]:
#Un solo valor asignado a una columna, provocará que se propague por todo el tamaño

econ_df['repeated'] = 1
econ_df

Unnamed: 0,economist,death_year,book,repeated
0,Marx,1883,The Capital,1
1,Ricardo,1823,On the Principles of Political Economy and Tax...,1
2,Smith,1790,The Wealth of Nations,1


###Otras formas de selección

In [52]:
#Seleccionamos renglones con loc mediante su etiqueta o posicion en indices
econ_df.loc[1]

economist                                               Ricardo
death_year                                                 1823
book          On the Principles of Political Economy and Tax...
repeated                                                      1
Name: 1, dtype: object

In [53]:
#Doble corchete otorga un DF

econ_df.loc[[1,2]]

Unnamed: 0,economist,death_year,book,repeated
1,Ricardo,1823,On the Principles of Political Economy and Tax...,1
2,Smith,1790,The Wealth of Nations,1


In [54]:
econ_df.loc[0:1, 'economist']

0       Marx
1    Ricardo
Name: economist, dtype: object

In [55]:
#Condicionales con loc, y regresa un DF

econ_df.loc[econ_df['death_year'] > 1800]

Unnamed: 0,economist,death_year,book,repeated
0,Marx,1883,The Capital,1
1,Ricardo,1823,On the Principles of Political Economy and Tax...,1


In [56]:
#Asignar valores

econ_df.loc[0,'repeated'] = 0
econ_df

Unnamed: 0,economist,death_year,book,repeated
0,Marx,1883,The Capital,0
1,Ricardo,1823,On the Principles of Political Economy and Tax...,1
2,Smith,1790,The Wealth of Nations,1


In [57]:
#Ocupamos iloc para seleccionar renglon por ubicación de número sin importar como esta el indice

econ_df.iloc[0]

economist            Marx
death_year           1883
book          The Capital
repeated                0
Name: 0, dtype: object

In [58]:
#Con doble corchete arroja un df
econ_df.iloc[[0]]

Unnamed: 0,economist,death_year,book,repeated
0,Marx,1883,The Capital,0


In [63]:
#Se puede mezclar tanto indices de columnas y renglones
econ_df.iloc[0,0]

'Marx'

In [66]:
econ_df.iloc[:2, 1]

0    1883
1    1823
Name: death_year, dtype: int64

###Estructuras de datos adyacentes

El dataframe alberga en cada columna una serie (series)

In [4]:
#Para acceder se ocupa lo siguiente, hace rememorar como accedemos a valores en diccionarios

econ_df["economist"]

0       Marx
1    Ricardo
2      Smith
Name: economist, dtype: object

In [5]:
#De igual forma se puede crear una serie desde la librería, hay que notar que 
#una serie en pandas no tiene nombre de columna

other_econ = pd.Series(["Friedman","Yellen"], name="economist")
other_econ

0    Friedman
1      Yellen
Name: economist, dtype: object

In [9]:
#De diccionario a serie en pandas

dict_ = {"a": 1, "b": 2, "c": 3, "d": 4}
pd.Series(dict_)

a    1
b    2
c    3
d    4
dtype: int64

In [10]:
#Usamos una librería auxiliar llamada numpy que se usa para emular estructuras matematicas y demás

import numpy as np 

dict_2 = pd.Series(np.random.randn(8), index=['a','b','c','d','e','f','g','h'])
dict_2


a   -0.364475
b    0.304384
c    0.005643
d    0.045326
e   -0.910787
f   -1.372465
g   -0.055360
h   -0.233163
dtype: float64

In [12]:
dict_3 = pd.Series(np.random.randn(8), index=['h','b','c','d','e','f','g','a'])
dict_3

h    0.218436
b   -0.452562
c    0.265557
d   -0.410621
e    0.467148
f    0.334874
g   -0.790630
a   -0.540346
dtype: float64

In [15]:
#Missing values

dict_ = {"a": 1, "b": 2, "c": 3, "d": 4}
pd.Series(dict_, index=['a','b','d','c','e'])

a    1.0
b    2.0
d    4.0
c    3.0
e    NaN
dtype: float64

In [19]:
#Series like array
dict_3[0]

0.21843562781213724

In [23]:
dict_3[dict_3 > dict_3.median()]

h    0.218436
c    0.265557
e    0.467148
f    0.334874
dtype: float64

In [25]:
#Operaciones vectoriales

dict_3 + dict_3

h    0.436871
b   -0.905124
c    0.531114
d   -0.821242
e    0.934296
f    0.669748
g   -1.581259
a   -1.080692
dtype: float64

In [24]:
np.exp(dict_3)

h    1.244129
b    0.635997
c    1.304157
d    0.663238
e    1.595437
f    1.397764
g    0.453559
a    0.582547
dtype: float64

In [28]:
dict_3 * 100

h    21.843563
b   -45.256179
c    26.555707
d   -41.062096
e    46.714794
f    33.487402
g   -79.062956
a   -54.034586
dtype: float64

###Operaciones en estas estructuras de datos

In [6]:
econ_df["death_year"].max()

1883

In [7]:
#Este método omite aquellas columnas que no albergan valores númericos
econ_df.describe()

Unnamed: 0,death_year
count,3.0
mean,1832.0
std,47.148701
min,1790.0
25%,1806.5
50%,1823.0
75%,1853.0
max,1883.0


In [67]:
sub_econ_df = econ_df[(econ_df["death_year"] > 1800)]
sub_econ_df

Unnamed: 0,economist,death_year,book,repeated
0,Marx,1883,The Capital,0
1,Ricardo,1823,On the Principles of Political Economy and Tax...,1


In [68]:
sub_econ_df.shape

(2, 4)

In [70]:
sub_econ_df = econ_df.loc[econ_df["death_year"] > 1800, "economist"]
sub_econ_df

0       Marx
1    Ricardo
Name: economist, dtype: object