https://github.com/wesm/pydata-book

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
import numpy as np
np.random.seed(12345) #fijo una semilla
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

# 5.1 Introduction to pandas Data Structures

# 5.1.1 Series

In [3]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
print(obj.values)
print(obj.index) # like range(4)

[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


In [5]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
obj2.index

d    4
b    7
a   -5
c    3
dtype: int64


Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single
values or a set of values:
Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains
strings instead of integers.

In [7]:
#SENTIDO COMÚN
print(obj2['a'])
obj2['d'] = 6
obj2[['c', 'a', 'd']] #le he aplicado un filtro, pero no se ha guardado

-5


c    3
a   -5
d    6
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link:

In [8]:
obj2 > 0

d     True
b     True
a    False
c     True
dtype: bool

In [9]:
#SENTIDO COMÚN
print(obj2[obj2 > 0]) #le he aplicado un filtro, pero no se ha guardado
print("------------")
print(obj2 * 2)
print("------------")
print(np.exp(obj2))

d    6
b    7
c    3
dtype: int64
------------
d    12
b    14
a   -10
c     6
dtype: int64
------------
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


Another way to think about a Series is as a fixed-length, ordered dict,
as *it is a mapping of index values to data values.* 
It can be used in many contexts where you might use a dict:

In [45]:
print(obj2)
print(obj2.values)
print('b' in obj2)
print('e' in obj2)
print(7 in obj2)
print(7 in obj2.values)

d    4
b    7
a   -5
c    3
dtype: int64
[ 4  7 -5  3]
True
False
False
True


###### OJO
###### FORMAS DE CREAR UNA SERIE
Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:

In [11]:
#FORMAS DE CREAR UNA SERIE
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
#or more visual
sdata = {'Ohio': 35000, 
         'Texas': 71000, 
         'Oregon': 16000, 
         'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [12]:
#PASARLE EL ORDEN DE LOS INDICES/filtrado
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
#Como hemos excluido 'Utah' de los indices, quedará excluida

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

- isnull function
- notnull function 
to detect missing data:

In [13]:
print(pd.isnull(obj4)) #si hay valor nulo-->True
print("------------")
print(pd.notnull(obj4))#negada de la anterior: si no hay valor nulo-->False

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
------------
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


In [14]:
#Series also has these as instance methods:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful Series feature
*it automatically aligns by index label in arithmetic operations*

In [15]:
#SENTIDO COMÚN
print(obj3)
print("------------")
print(obj4)
print("------------")
print(obj3 + obj4)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
------------
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
------------
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


#### Both the Series object itself and its index have a *name attribute*

In [16]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series’s index can be altered in-place by assignment

In [17]:
#CAMBIAR LOS INDICES-->SE GUARDA
print(obj)
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] #
obj

0    4
1    7
2   -5
3    3
dtype: int64


Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

# 5.1.2 DataFrame

- rectangular table of data and contains an ordered collection of columns
- each of which can be a different value type (numeric, string, boolean, etc.)
- has both a row and column index
it can be thought of as a dict of Series all sharing the same index
the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays.

##### OJO
While a DataFrame is physically two-dimensional, you can use it to
represent higher dimensional data in a tabular format using hierarchical
indexing, a subject we will discuss in Chapter 8 and an
ingredient in some of the more advanced data-handling features in
pandas.

In [18]:
#CREAR UN DATAFRAME
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [19]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [20]:
#SENTIDO COMÚN
#ORDENAR COLUMNAS/filtrado
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


si le pasas una columna que no esté contemplada en el dict, aparecerán NaN

In [21]:
#SENTIDO COMÚN
#CAMBIAR LOS INDICES
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], #'debt' NO ESTÁ
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
print(frame2)
frame2.columns

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN


Index(['year', 'state', 'pop', 'debt'], dtype='object')

#### OJO
A column in a DataFrame can be retrieved as a Series either by dict-like notation or
by attribute:
NOTA: frame2[column] works for any column name, but frame2.column only works when the column name is a valid Python variable
name.

In [22]:
#SACAR UNA SERIE/COLUMNA DE UN DATAFRAME
print(frame2)
print("------------")
print(frame2['state'])# por filtrado
print("------------")
print(frame2.year) #por atributo

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN
------------
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
------------
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


In [23]:
frame3 = frame2['state']
frame3

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Note that the returned Series have the same index as the DataFrame, and their name
attribute has been appropriately set.

In [24]:
frame3.index = ['1','2','3','4','5','6']
frame3

1      Ohio
2      Ohio
3      Ohio
4    Nevada
5    Nevada
6    Nevada
Name: state, dtype: object

In [25]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [26]:
#SACAR UNA ROW DE UN DATAFRAME
row_frame2 = frame2.loc['three']
row_frame2

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

#### OJO

In [27]:
#MODIFICACIÓN DE LOS VALORES DE UNA COLUMNA
#Columns can be modified by assignment.
frame2['debt'] = 15.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,15.5
two,2001,Ohio,1.7,15.5
three,2002,Ohio,3.6,15.5
four,2001,Nevada,2.4,15.5
five,2002,Nevada,2.9,15.5
six,2003,Nevada,3.2,15.5


In [28]:
frame2['debt'] = np.arange(6)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4
six,2003,Nevada,3.2,5


In [29]:
#SENTIDO COMÚN
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
#or
sdata = {'two': -1.2, 'four': -1.5, 'five': -1.7} #le tengo que dar el mismo nombre a los indices_serie = indices_df
val = pd.Series(sdata) 

frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column. 
The del keyword will delete columns as with a dict

In [30]:
#CREAR UNA NUEVA COLUMNA Y BORRAR UNA COLUMNA
print(frame2.state == 'Ohio') #boolean array
frame2['eastern'] = frame2.state == 'Ohio'
frame2

one       True
two       True
three     True
four     False
five     False
six      False
Name: state, dtype: bool


Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [31]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


New columns cannot be created with the frame2.eastern syntax.

##### OJO
The column returned from indexing a DataFrame is a view on the
underlying data, not a copy. 
Thus, any in-place modifications to the Series will be reflected in the DataFrame. 
The column can be explicitly copied with the Series’s copy method.

Otra forma comun de tener los datos es a nested dict of dicts

In [32]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [33]:
#CREAR UN DATAFRAME
#data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        #'year': [2000, 2001, 2002, 2001, 2002, 2003],
        #'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
#frame = pd.DataFrame(data)
#frame

If the nested dict is passed to the DataFrame, pandas will interpret 
- the outer dict keys as the columns and 
- the inner keys as the row indices:

In [34]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


##### transponer un df
(swap rows and columns) with similar syntax to a NumPy array

In [35]:
#TRANSPONER
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


The keys in the inner dicts are combined and sorted to form the index in the result.
This isn’t true if an explicit index is specified:

In [36]:
#SENTIDO COMÚN
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
print(frame3)
pd.DataFrame(pop, index=[2001, 2002, 2003])

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [37]:
#SENTIDO COMÚN
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [38]:
#PONER NOMBRE AL INDICE Y A LAS COLUMNAS
print(frame3)
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


Como en las Series, el atributo
*values*
devuelve los datos contenido en el df como un array 2D

In [39]:
#SENTIDO COMÚN
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [40]:
frame3.values.dtype

dtype('float64')

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accommodate all of the columns:

In [41]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [42]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

# 5.1.3 Index Objects

are responsible for holding the axis labels-ETIQUETAS DE LOS EJES and other metadata
(like the axis name or names).

Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index:

In [43]:
data = range(3) #0,1,2
obj = pd.Series(data, index=['a', 'b', 'c'])
print(obj)
index = obj.index
print(index)
index[1:]

a    0
b    1
c    2
dtype: int64
Index(['a', 'b', 'c'], dtype='object')


Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [44]:
index[1] = 'd' # TypeError

TypeError: Index does not support mutable operations

In [46]:
#AUTOMATIZAR
labels = pd.Index(np.arange(3))
print(labels)
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

Int64Index([0, 1, 2], dtype='int64')


0    1.5
1   -2.5
2    0.0
dtype: float64

In [47]:
obj2.index is labels

True

In [48]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [49]:
print(frame3.columns)
print('Ohio' in frame3.columns)
print(2003 in frame3.index)

Index(['Nevada', 'Ohio'], dtype='object', name='state')
True
False
