<img src="mioti.png" style="height: 100px">
<center style="color:#888">Data Science with Python</center>

# DSPy4. Pandas "advanced"

In [2]:
import pandas as pd

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Alineación-de-datos-y-operaciones-aritméticas" data-toc-modified-id="Alineación-de-datos-y-operaciones-aritméticas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Alineación de datos y operaciones aritméticas</a></span></li><li><span><a href="#Reindexing" data-toc-modified-id="Reindexing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reindexing</a></span></li><li><span><a href="#Apply" data-toc-modified-id="Apply-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Apply</a></span></li><li><span><a href="#Group-by" data-toc-modified-id="Group-by-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Group by</a></span></li><li><span><a href="#Merging-DataFrames" data-toc-modified-id="Merging-DataFrames-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Merging DataFrames</a></span></li><li><span><a href="#Tipo-de-datos-&quot;Categoría&quot;" data-toc-modified-id="Tipo-de-datos-&quot;Categoría&quot;-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Tipo de datos "Categoría"</a></span></li><li><span><a href="#Pivot-Tables" data-toc-modified-id="Pivot-Tables-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Pivot Tables</a></span></li><li><span><a href="#Method-chaining-vs.-chain-indexing" data-toc-modified-id="Method-chaining-vs.-chain-indexing-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Method chaining vs. chain indexing</a></span></li></ul></div>

### Alineación de datos y operaciones aritméticas

Cuando realizamos operaciones aritméticas sobre las estructuras de datos de Pandas, el resultado se alinea por los índices (tanto fila y columna), añadiendo `NaN` cuando no hay coincidencia:

In [3]:
df1 = pd.DataFrame(np.arange(9).reshape((3,3)),
    columns=['Wind', 'Temp', 'Water_Q'],
    index=['Omaha', 'Copacabana', 'Bondi'])
df1

Unnamed: 0,Wind,Temp,Water_Q
Omaha,0,1,2
Copacabana,3,4,5
Bondi,6,7,8


In [4]:
df2 = pd.DataFrame(np.arange(15).reshape((5,3)),
    columns=['Wind', 'Air_Q', 'Water_Q'],
    index=['La Concha', 'Omaha', 'Copacabana', 'Bondi', 'Waikiki'])
df2

Unnamed: 0,Wind,Air_Q,Water_Q
La Concha,0,1,2
Omaha,3,4,5
Copacabana,6,7,8
Bondi,9,10,11
Waikiki,12,13,14


In [5]:
(df1 + df2) / 2

Unnamed: 0,Air_Q,Temp,Water_Q,Wind
Bondi,,,9.5,7.5
Copacabana,,,6.5,4.5
La Concha,,,,
Omaha,,,3.5,1.5
Waikiki,,,,


Y esto, también pasa en asignaciones:

In [6]:
df2["Noise level"] = pd.Series([32,21], index=["Bondi", "Copacabana"])
df2

Unnamed: 0,Wind,Air_Q,Water_Q,Noise level
La Concha,0,1,2,
Omaha,3,4,5,
Copacabana,6,7,8,21.0
Bondi,9,10,11,32.0
Waikiki,12,13,14,


In [7]:
df2.loc["Bora Bora"] = pd.Series([1,2] , index=["Wind", "Noise level"])
df2

Unnamed: 0,Wind,Air_Q,Water_Q,Noise level
La Concha,0.0,1.0,2.0,
Omaha,3.0,4.0,5.0,
Copacabana,6.0,7.0,8.0,21.0
Bondi,9.0,10.0,11.0,32.0
Waikiki,12.0,13.0,14.0,
Bora Bora,1.0,,,2.0


Se pueden especificar valores de relleno:

In [8]:
df1.add(df2, fill_value=0) / 2

Unnamed: 0,Air_Q,Noise level,Temp,Water_Q,Wind
Bondi,5.0,16.0,3.5,9.5,7.5
Bora Bora,,1.0,,,0.5
Copacabana,3.5,10.5,2.0,6.5,4.5
La Concha,0.5,,,1.0,0.0
Omaha,2.0,,0.5,3.5,1.5
Waikiki,6.5,,,7.0,6.0


### Reindexing

Imaginemos que lanzamos una consulta a una BBDD para obtener las ventas agregadas de refrescos del último mes y obtenemos lo siguiente:

In [9]:
sales_month = pd.DataFrame(
    data = {
        'units': [1200, 800, 100],
        'amount': [2300, 1500, 1200]
    }, 
    index=['Coca-Cola', 'Pepsi', 'Fanta'])
sales_month

Unnamed: 0,units,amount
Coca-Cola,1200,2300
Pepsi,800,1500
Fanta,100,1200


Los productos que no vendieran nada ese mes no aparecerían como agregados lógicamente... Sin embargo, a nosotros nos interesaría que aparecieran en el resultado final, con ventas = 0:

In [10]:
master_products = ['Pepsi', "Gold Cola", "Vodka Kneip", "Fanta"]
sales_month = sales_month.reindex(master_products, fill_value=0)
sales_month

Unnamed: 0,units,amount
Pepsi,800,1500
Gold Cola,0,0
Vodka Kneip,0,0
Fanta,100,1200


###  Apply

Para aplicar funciones a un dataframe por columnas o por filas, **si es que no puedo vectorizar la operación**

In [11]:
df = pd.read_csv('census.csv')[["CTYNAME","POPESTIMATE2010","POPESTIMATE2011"]]
df = df.set_index('CTYNAME')
df.head(10)

Unnamed: 0_level_0,POPESTIMATE2010,POPESTIMATE2011
CTYNAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,4785161,4801108
Autauga County,54660,55253
Baldwin County,183193,186659
Barbour County,27341,27226
Bibb County,22861,22733
Blount County,57373,57711
Bullock County,10887,10629
Butler County,20944,20673
Calhoun County,118437,117768
Chambers County,34098,33993


In [12]:
def min_and_max(estimates):
    return pd.Series([estimates.min(), estimates.max()], index=["min", "max"])    

Si queremos aplicar la función a cada columna:

In [13]:
df_col = df.apply(min_and_max, axis='rows')
df_col

Unnamed: 0,POPESTIMATE2010,POPESTIMATE2011
min,83,90
max,37334079,37700034


Si queremos aplicar la función a cada fila:

In [14]:
df_row = df.apply(min_and_max, axis='columns')
df_row.head(5)

Unnamed: 0_level_0,min,max
CTYNAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama,4785161,4801108
Autauga County,54660,55253
Baldwin County,183193,186659
Barbour County,27226,27341
Bibb County,22733,22861


### Group by

In [76]:
df = pd.read_csv('census.csv', usecols=["STNAME", "CTYNAME","POPESTIMATE2010","POPESTIMATE2011"])
df.head(10)

Unnamed: 0,STNAME,CTYNAME,POPESTIMATE2010,POPESTIMATE2011
0,Alabama,Alabama,4785161,4801108
1,Alabama,Autauga County,54660,55253
2,Alabama,Baldwin County,183193,186659
3,Alabama,Barbour County,27341,27226
4,Alabama,Bibb County,22861,22733
5,Alabama,Blount County,57373,57711
6,Alabama,Bullock County,10887,10629
7,Alabama,Butler County,20944,20673
8,Alabama,Calhoun County,118437,117768
9,Alabama,Chambers County,34098,33993


In [16]:
df_grouped = df.groupby('STNAME')
df_grouped = df_grouped.agg(
    {'POPESTIMATE2010': [("2010_mean","mean")], 
     'POPESTIMATE2011': [("2011_sum","sum")]})
df_grouped.head(10)

Unnamed: 0_level_0,POPESTIMATE2010,POPESTIMATE2011
Unnamed: 0_level_1,2010_mean,2011_sum
STNAME,Unnamed: 1_level_2,Unnamed: 2_level_2
Alabama,140740.0,9602216
Alaska,47601.4,1445440
Arizona,801026.0,12937464
Arkansas,76905.11,5877076
California,1265562.0,75400068
Colorado,155330.9,10238960
Connecticut,795492.7,7179518
Delaware,449895.5,1815832
District of Columbia,605126.0,1240944
Florida,554408.5,38211066


### Merging DataFrames

In [36]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df.head())
print()
print(student_df.head())

                 Role
Name                 
Kelly  Director of HR
Sally  Course liasion
James          Grader

            School
Name              
James     Business
Mike           Law
Sally  Engineering


In [37]:
pd.merge?

In [38]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Law
Sally,Course liasion,Engineering


In [39]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liasion,Engineering
James,Grader,Business


In [40]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liasion,Engineering
James,Grader,Business


In [41]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course liasion,Engineering


In [42]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

Unnamed: 0,Name,Role,School
0,Kelly,Director of HR,
1,Sally,Course liasion,Engineering
2,James,Grader,Business


In [43]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liasion,Washington Avenue,Engineering,512 Wilson Crescent
2,James,Grader,Washington Avenue,Business,1024 Billiard Avenue


In [44]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
print(staff_df)
print(student_df)
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])


  First Name   Last Name            Role
0      Kelly  Desjardins  Director of HR
1      Sally      Brooks  Course liasion
2      James       Wilde          Grader
  First Name Last Name       School
0      James   Hammond     Business
1       Mike     Smith          Law
2      Sally    Brooks  Engineering


Unnamed: 0,First Name,Last Name,Role,School
0,Sally,Brooks,Course liasion,Engineering


### Tipo de datos "Categoría"

In [46]:
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'])
df.rename(columns={0: 'Grades'}, inplace=True)
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [47]:
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

Si además queremos expresar un orden entre las categorias:


In [48]:
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                            ordered=True)

In [49]:
grades = df['Grades'].astype(cat_type)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

### Pivot Tables

In [50]:
df = pd.read_csv('cars.csv')

In [51]:
df.head(12)

Unnamed: 0,YEAR,Make,Model,Size,(kW),Unnamed: 5,TYPE,CITY (kWh/100 km),HWY (kWh/100 km),COMB (kWh/100 km),CITY (Le/100 km),HWY (Le/100 km),COMB (Le/100 km),(g/km),RATING,(km),TIME (h)
0,2012,MITSUBISHI,i-MiEV,SUBCOMPACT,49,A1,B,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
1,2012,NISSAN,LEAF,MID-SIZE,80,A1,B,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
2,2013,FORD,FOCUS ELECTRIC,COMPACT,107,A1,B,19.0,21.1,20.0,2.1,2.4,2.2,0,,122,4
3,2013,MITSUBISHI,i-MiEV,SUBCOMPACT,49,A1,B,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
4,2013,NISSAN,LEAF,MID-SIZE,80,A1,B,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
5,2013,SMART,FORTWO ELECTRIC DRIVE CABRIOLET,TWO-SEATER,35,A1,B,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
6,2013,SMART,FORTWO ELECTRIC DRIVE COUPE,TWO-SEATER,35,A1,B,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
7,2013,TESLA,MODEL S (40 kWh battery),FULL-SIZE,270,A1,B,22.4,21.9,22.2,2.5,2.5,2.5,0,,224,6
8,2013,TESLA,MODEL S (60 kWh battery),FULL-SIZE,270,A1,B,22.2,21.7,21.9,2.5,2.4,2.5,0,,335,10
9,2013,TESLA,MODEL S (85 kWh battery),FULL-SIZE,270,A1,B,23.8,23.2,23.6,2.7,2.6,2.6,0,,426,12


In [52]:
df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)

Make,BMW,CHEVROLET,FORD,KIA,MITSUBISHI,NISSAN,SMART,TESLA
YEAR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012,,,,,49.0,80.0,,
2013,,,107.0,,49.0,80.0,35.0,280.0
2014,,104.0,107.0,,49.0,80.0,35.0,268.333333
2015,125.0,104.0,107.0,81.0,49.0,80.0,35.0,320.666667
2016,125.0,104.0,107.0,81.0,49.0,80.0,35.0,409.7


In [53]:
df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=[np.mean,np.min], margins=True)

Unnamed: 0_level_0,mean,mean,mean,mean,mean,mean,mean,mean,mean,amin,amin,amin,amin,amin,amin,amin,amin,amin
Make,BMW,CHEVROLET,FORD,KIA,MITSUBISHI,NISSAN,SMART,TESLA,All,BMW,CHEVROLET,FORD,KIA,MITSUBISHI,NISSAN,SMART,TESLA,All
YEAR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2012,,,,,49.0,80.0,,,64.5,,,,,49.0,80.0,,,49
2013,,,107.0,,49.0,80.0,35.0,280.0,158.444444,,,107.0,,49.0,80.0,35.0,270.0,35
2014,,104.0,107.0,,49.0,80.0,35.0,268.333333,135.0,,104.0,107.0,,49.0,80.0,35.0,225.0,35
2015,125.0,104.0,107.0,81.0,49.0,80.0,35.0,320.666667,181.428571,125.0,104.0,107.0,81.0,49.0,80.0,35.0,280.0,35
2016,125.0,104.0,107.0,81.0,49.0,80.0,35.0,409.7,252.263158,125.0,104.0,107.0,81.0,49.0,80.0,35.0,283.0,35
All,125.0,104.0,107.0,81.0,49.0,80.0,35.0,345.478261,190.622642,125.0,104.0,107.0,81.0,49.0,80.0,35.0,225.0,35


También existe la operación inversa:

In [54]:
df_reduced = df[['YEAR', 'Model','(kW)']]
df_reduced

Unnamed: 0,YEAR,Model,(kW)
0,2012,i-MiEV,49
1,2012,LEAF,80
2,2013,FOCUS ELECTRIC,107
3,2013,i-MiEV,49
4,2013,LEAF,80
5,2013,FORTWO ELECTRIC DRIVE CABRIOLET,35
6,2013,FORTWO ELECTRIC DRIVE COUPE,35
7,2013,MODEL S (40 kWh battery),270
8,2013,MODEL S (60 kWh battery),270
9,2013,MODEL S (85 kWh battery),270


In [55]:
df_pivoted = df_reduced.pivot_table(values='(kW)', index='YEAR', columns='Model')
df_pivoted

Model,FOCUS ELECTRIC,FORTWO ELECTRIC DRIVE CABRIOLET,FORTWO ELECTRIC DRIVE COUPE,LEAF,LEAF (24 kWh battery),LEAF (30 kWh battery),MODEL S (40 kWh battery),MODEL S (60 kWh battery),MODEL S (70 kWh battery),MODEL S (85 kWh battery),...,MODEL S 90D (Refresh),MODEL S P85D/P90D,MODEL S P90D (Refresh),MODEL S PERFORMANCE,MODEL X 90D,MODEL X P90D,SOUL EV,SPARK EV,i-MiEV,i3
YEAR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012,,,,80.0,,,,,,,...,,,,,,,,,49.0,
2013,107.0,35.0,35.0,80.0,,,270.0,270.0,,270.0,...,,,,310.0,,,,,49.0,
2014,107.0,35.0,35.0,80.0,,,,225.0,,270.0,...,,,,310.0,,,,104.0,49.0,
2015,107.0,35.0,35.0,80.0,,,,283.0,283.0,,...,,515.0,,,,,81.0,104.0,49.0,125.0
2016,107.0,35.0,35.0,,80.0,80.0,,283.0,283.0,,...,386.0,568.0,568.0,,386.0,568.0,81.0,104.0,49.0,125.0


In [56]:
df_restored = df_pivoted.stack()
df_restored

YEAR  Model                          
2012  LEAF                                80.0
      i-MiEV                              49.0
2013  FOCUS ELECTRIC                     107.0
      FORTWO ELECTRIC DRIVE CABRIOLET     35.0
      FORTWO ELECTRIC DRIVE COUPE         35.0
      LEAF                                80.0
      MODEL S (40 kWh battery)           270.0
      MODEL S (60 kWh battery)           270.0
      MODEL S (85 kWh battery)           270.0
      MODEL S PERFORMANCE                310.0
      i-MiEV                              49.0
2014  FOCUS ELECTRIC                     107.0
      FORTWO ELECTRIC DRIVE CABRIOLET     35.0
      FORTWO ELECTRIC DRIVE COUPE         35.0
      LEAF                                80.0
      MODEL S (60 kWh battery)           225.0
      MODEL S (85 kWh battery)           270.0
      MODEL S PERFORMANCE                310.0
      SPARK EV                           104.0
      i-MiEV                              49.0
2015  FOCUS ELECTRIC  

In [57]:
df_restored.name = "(kW)"

In [58]:
df_restored.reset_index()

Unnamed: 0,YEAR,Model,(kW)
0,2012,LEAF,80.0
1,2012,i-MiEV,49.0
2,2013,FOCUS ELECTRIC,107.0
3,2013,FORTWO ELECTRIC DRIVE CABRIOLET,35.0
4,2013,FORTWO ELECTRIC DRIVE COUPE,35.0
5,2013,LEAF,80.0
6,2013,MODEL S (40 kWh battery),270.0
7,2013,MODEL S (60 kWh battery),270.0
8,2013,MODEL S (85 kWh battery),270.0
9,2013,MODEL S PERFORMANCE,310.0


###  Method chaining vs. chain indexing

In [62]:
df = pd.read_csv('census.csv')
df

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


Las funciones de Pandas devuelven copias de los DataFrames sobre los que operan, y podemos, por tanto, encadenar llamadas. Los siguientes dos bloques de código son equivalentes:

In [65]:
df_unchained = df.set_index(['STNAME','CTYNAME'])
df_unchained = df_unchained.rename(columns={'SUMLEV': 'SUMMARYLEVELS'})
df_unchained.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMMARYLEVELS,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Alabama,40,3,6,1,0,4779736,4780127,4785161,4801108,4816089,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411
Alabama,Bullock County,50,3,6,1,11,10914,10915,10887,10629,10606,...,-30.953709,-5.180127,-1.130263,14.35429,-16.167247,-29.001673,-2.825524,1.507017,17.24379,-13.193961
Alabama,Butler County,50,3,6,1,13,20947,20946,20944,20673,20408,...,-14.032727,-11.684234,-5.655413,1.085428,-6.529805,-13.936612,-11.586865,-5.557058,1.184103,-6.430868
Alabama,Calhoun County,50,3,6,1,15,118572,118586,118437,117768,117286,...,-6.15567,-4.611706,-5.524649,-4.463211,-3.376322,-5.791579,-4.092677,-5.062836,-3.912834,-2.806406
Alabama,Chambers County,50,3,6,1,17,34215,34170,34098,33993,34075,...,-2.731639,3.849092,2.872721,-2.287222,1.349468,-1.821092,4.701181,3.781439,-1.290228,2.346901


In [70]:
df_chained = (df
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'SUMLEV': 'SUMMARYLEVELS'})
    # ...
)
df_chained

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMMARYLEVELS,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Alabama,40,3,6,1,0,4779736,4780127,4785161,4801108,4816089,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,50,4,8,56,37,43806,43806,43593,44041,45104,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
Wyoming,Teton County,50,4,8,56,39,21294,21294,21297,21482,21697,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
Wyoming,Uinta County,50,4,8,56,41,21118,21118,21102,20912,20989,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
Wyoming,Washakie County,50,4,8,56,43,8533,8533,8545,8469,8443,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


Pero **ojo cuidado**, esto no es necesariamente así con el operador de indexado: a veces se devuelve una copia, a veces se devuelve una referencia al objeto, **y no hay una manera de predecirlo**

In [91]:
df = (pd
    .read_csv('census.csv', usecols=['STNAME', 'CTYNAME', 'POPESTIMATE2010', 'POPESTIMATE2011'])
    .set_index('STNAME'))
df

Unnamed: 0_level_0,CTYNAME,POPESTIMATE2010,POPESTIMATE2011
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,Alabama,4785161,4801108
Alabama,Autauga County,54660,55253
Alabama,Baldwin County,183193,186659
Alabama,Barbour County,27341,27226
Alabama,Bibb County,22861,22733
...,...,...,...
Wyoming,Sweetwater County,43593,44041
Wyoming,Teton County,21297,21482
Wyoming,Uinta County,21102,20912
Wyoming,Washakie County,8545,8469


In [93]:
df.loc['Alabama', :]

Unnamed: 0_level_0,CTYNAME,POPESTIMATE2010,POPESTIMATE2011
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,Alabama,4785161,4801108
Alabama,Autauga County,54660,55253
Alabama,Baldwin County,183193,186659
Alabama,Barbour County,27341,27226
Alabama,Bibb County,22861,22733
...,...,...,...
Alabama,Tuscaloosa County,194977,196638
Alabama,Walker County,67004,66641
Alabama,Washington County,17610,17336
Alabama,Wilcox County,11557,11488


In [94]:
df.loc['Alabama', 'POPESTIMATE2010'] = 0
df

Unnamed: 0_level_0,CTYNAME,POPESTIMATE2010,POPESTIMATE2011
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,Alabama,0,4801108
Alabama,Autauga County,0,55253
Alabama,Baldwin County,0,186659
Alabama,Barbour County,0,27226
Alabama,Bibb County,0,22733
...,...,...,...
Wyoming,Sweetwater County,43593,44041
Wyoming,Teton County,21297,21482
Wyoming,Uinta County,21102,20912
Wyoming,Washakie County,8545,8469


Y sin embargo...

In [95]:
piece_of_df = df.loc['Alabama']
piece_of_df

Unnamed: 0_level_0,CTYNAME,POPESTIMATE2010,POPESTIMATE2011
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,Alabama,0,4801108
Alabama,Autauga County,0,55253
Alabama,Baldwin County,0,186659
Alabama,Barbour County,0,27226
Alabama,Bibb County,0,22733
...,...,...,...
Alabama,Tuscaloosa County,0,196638
Alabama,Walker County,0,66641
Alabama,Washington County,0,17336
Alabama,Wilcox County,0,11488


In [96]:
piece_of_df['POPESTIMATE2011'] = 1
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,CTYNAME,POPESTIMATE2010,POPESTIMATE2011
STNAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,Alabama,0,1
Alabama,Autauga County,0,1
Alabama,Baldwin County,0,1
Alabama,Barbour County,0,1
Alabama,Bibb County,0,1
...,...,...,...
Wyoming,Sweetwater County,43593,44041
Wyoming,Teton County,21297,21482
Wyoming,Uinta County,21102,20912
Wyoming,Washakie County,8545,8469
