# More Basics of Pandas

Last session we introduced Pandas quickly. Now we review the basics of using Pandas, with a more interactive workflow to solidify understanding and facility with this powerful Python toolkit.

## Reviewing Pandas Basics

In [6]:
from pandas import DataFrame
import pandas as pd

In [2]:
import pandas as pd
df = pd.read_csv('DEC_10_SF1_P1_with_ann.csv')

### Basics of indexing in Pandas

The first use of indexing is to use a slice, just like we have done with other Python objects. Below we slice the first 5 index values of the first dimension of the dataframe.

In [3]:
df[:5]  #select particular rows, in this case the first five ---> df[0:5]

Unnamed: 0,Id,Id2,Geography,Population
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261


In [6]:
df["Id"] #Prints all rows in the "Id" column

0       1400000US21001970100
1       1400000US21001970200
2       1400000US21001970300
3       1400000US21001970401
4       1400000US21001970402
                ...         
1110    1400000US21239050106
1111    1400000US21239050107
1112    1400000US21239050200
1113    1400000US21239050300
1114    1400000US21239050400
Name: Id, Length: 1115, dtype: object

In [7]:
df["Id"] [:5] #Prints specified rows from "Id" column

0    1400000US21001970100
1    1400000US21001970200
2    1400000US21001970300
3    1400000US21001970401
4    1400000US21001970402
Name: Id, dtype: object

In [12]:
df['Population'][4]   #Finds a specific cell

4261

The first indexing method is equivalent to usinf the iloc indexing method, which uses the integer based indexing, purely based on the location of the index.

In [4]:
df.iloc[:5]

Unnamed: 0,Id,Id2,Geography,Population
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261


A second way to index is using loc, which uses the labels of the index. Note that this approach includes the second value in the index range, whereas iloc does not.

In [13]:
df.loc[:5]

Unnamed: 0,Id,Id2,Geography,Population
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261
5,1400000US21001970500,21001970500,"Census Tract 9705, Adair County, Kentucky",2457


In [18]:
df.loc[5]  #gives information of row 5 or df.loc[5]

Id                                 1400000US21001970500
Id2                                         21001970500
Geography     Census Tract 9705, Adair County, Kentucky
Population                                         2457
Name: 5, dtype: object

In [20]:
df.iloc[0]#gives information of row 0

Id                                 1400000US21001970100
Id2                                         21001970100
Geography     Census Tract 9701, Adair County, Kentucky
Population                                         1727
Name: 0, dtype: object

In [24]:
df2=df.drop(1, axis=0)  #drop row 1
df2[:5]

Unnamed: 0,Id,Id2,Geography,Population
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261
5,1400000US21001970500,21001970500,"Census Tract 9705, Adair County, Kentucky",2457


In [25]:
df2.iloc[2]   #gives information of row 1 --> be careful we may be off

Id                                    1400000US21001970401
Id2                                            21001970401
Geography     Census Tract 9704.01, Adair County, Kentucky
Population                                            4070
Name: 3, dtype: object

In [26]:
df2.loc[2]     #gived information of row 2

Id                                 1400000US21001970300
Id2                                         21001970300
Geography     Census Tract 9703, Adair County, Kentucky
Population                                         3016
Name: 2, dtype: object

df.loc[5]

Note that indexing can work for both rows and colums

In [29]:
df.loc[:5,  :'Geography']  #column before Geography is to display all columns up till Geography

Unnamed: 0,Id,Id2,Geography
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky"
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky"
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky"
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky"
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky"
5,1400000US21001970500,21001970500,"Census Tract 9705, Adair County, Kentucky"


In [30]:
df.iloc[:5, :3]

Unnamed: 0,Id,Id2,Geography
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky"
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky"
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky"
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky"
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky"


We can select rows based on their value as well.  Notice that we nest df[df[condition]] to get this result.

In [34]:
df['Population']<100   #go through all rown and return when condition is true

0       False
1       False
2       False
3       False
4       False
        ...  
1110    False
1111    False
1112    False
1113    False
1114    False
Name: Population, Length: 1115, dtype: bool

In [35]:
df[df['Population'] < 200]  #to return the data frame them put it inside [brackets]

Unnamed: 0,Id,Id2,Geography,Population
64,1400000US21015980100,21015980100,"Census Tract 9801, Boone County, Kentucky",0
124,1400000US21029980100,21029980100,"Census Tract 9801, Bullitt County, Kentucky",0
203,1400000US21047980100,21047980100,"Census Tract 9801, Christian County, Kentucky",0
255,1400000US21061980100,21061980100,"Census Tract 9801, Edmonson County, Kentucky",8
678,1400000US21111980100,21111980100,"Census Tract 9801, Jefferson County, Kentucky",0
803,1400000US21143980100,21143980100,"Census Tract 9801, Lyon County, Kentucky",0
878,1400000US21163980100,21163980100,"Census Tract 9801, Meade County, Kentucky",9
1053,1400000US21221980100,21221980100,"Census Tract 9801, Trigg County, Kentucky",18
1054,1400000US21221980200,21221980200,"Census Tract 9802, Trigg County, Kentucky",6


In [36]:
df[df['Id2']==21015980100]

Unnamed: 0,Id,Id2,Geography,Population
64,1400000US21015980100,21015980100,"Census Tract 9801, Boone County, Kentucky",0


In [45]:
df[(df["Population"]>100) & (df["Population"]<200)]   # be careful for the brackets and paranthesis in conditions --> no rows satisfying the conditions

Unnamed: 0,Id,Id2,Geography,Population


In [51]:
df3=df[(df["Population"]>100) & (df["Population"]<600)]   #assign to a new data frame
#add .copy() tocreate a new frame and mess with it

In [52]:
df3 #call the new data frame

Unnamed: 0,Id,Id2,Geography,Population
325,1400000US21067003916,21067003916,"Census Tract 39.16, Fayette County, Kentucky",501.0


Here we show how to set a value of a cell in the table, identifying a specific row by index label, and setting its population, in this case to a None value, which Pandas interprets as a NaN (missing value).

In [54]:
df.loc[688,'Population'] = None #change the value of the cell

We can filter for values that are Null

In [55]:
df[df['Population'].isnull()] #test the change

Unnamed: 0,Id,Id2,Geography,Population
688,1400000US21115960100,21115960100,"Census Tract 9601, Johnson County, Kentucky",


Or more commonly, filter out the null values.

In [56]:
df["NewCol"]=0  #create a new column and assign zeros to all the rows
df

Unnamed: 0,Id,Id2,Geography,Population,NewCol
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727.0,0
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722.0,0
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016.0,0
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070.0,0
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261.0,0
...,...,...,...,...,...
1110,1400000US21239050106,21239050106,"Census Tract 501.06, Woodford County, Kentucky",3261.0,0
1111,1400000US21239050107,21239050107,"Census Tract 501.07, Woodford County, Kentucky",3757.0,0
1112,1400000US21239050200,21239050200,"Census Tract 502, Woodford County, Kentucky",3533.0,0
1113,1400000US21239050300,21239050300,"Census Tract 503, Woodford County, Kentucky",1899.0,0


In [58]:
df[df['Population'].notnull()]

Unnamed: 0,Id,Id2,Geography,Population,NewCol
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727.0,0
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722.0,0
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016.0,0
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070.0,0
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261.0,0
...,...,...,...,...,...
1110,1400000US21239050106,21239050106,"Census Tract 501.06, Woodford County, Kentucky",3261.0,0
1111,1400000US21239050107,21239050107,"Census Tract 501.07, Woodford County, Kentucky",3757.0,0
1112,1400000US21239050200,21239050200,"Census Tract 502, Woodford County, Kentucky",3533.0,0
1113,1400000US21239050300,21239050300,"Census Tract 503, Woodford County, Kentucky",1899.0,0


Here we find and print records that are in Fayette County, using the str attribute and 'contains' to search for the county name in geodisplay.

In [59]:
df[df['Geography'].str.contains('Fayette County')]  #give me all rows that have Fayette County in the Georgraphy column

Unnamed: 0,Id,Id2,Geography,Population,NewCol
262,1400000US21067000101,21067000101,"Census Tract 1.01, Fayette County, Kentucky",3072.0,0
263,1400000US21067000102,21067000102,"Census Tract 1.02, Fayette County, Kentucky",1567.0,0
264,1400000US21067000200,21067000200,"Census Tract 2, Fayette County, Kentucky",3563.0,0
265,1400000US21067000300,21067000300,"Census Tract 3, Fayette County, Kentucky",3157.0,0
266,1400000US21067000400,21067000400,"Census Tract 4, Fayette County, Kentucky",1261.0,0
...,...,...,...,...,...
339,1400000US21067004205,21067004205,"Census Tract 42.05, Fayette County, Kentucky",6628.0,0
340,1400000US21067004207,21067004207,"Census Tract 42.07, Fayette County, Kentucky",3855.0,0
341,1400000US21067004208,21067004208,"Census Tract 42.08, Fayette County, Kentucky",7097.0,0
342,1400000US21067004209,21067004209,"Census Tract 42.09, Fayette County, Kentucky",3916.0,0


We can find the unique values of a column (not very interesting in this particular case)

We saw last time how to use the str attribute to do string manipulation, such as to create a new column.  We need to explore some more advanced string processing, but let's use a smaller dataframe for that.

In [60]:
#add a column that takes the geography column, split by the commas: element 0 is census, element 1 is county, element 2 is state
df['state'] = df['Geography'].str.split(',').str[2]   #take element 2 state and put it in the state column
df[:5]

Unnamed: 0,Id,Id2,Geography,Population,NewCol,state
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727.0,0,Kentucky
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722.0,0,Kentucky
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016.0,0,Kentucky
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070.0,0,Kentucky
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261.0,0,Kentucky


In [64]:
df['county'] = df['Geography'].str.split(',').str[1]   #similarly assigning a county
df[:5]

Unnamed: 0,Id,Id2,Geography,Population,NewCol,state,county,tract
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727.0,0,Kentucky,Adair County,Census Tract 9701
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722.0,0,Kentucky,Adair County,Census Tract 9702
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016.0,0,Kentucky,Adair County,Census Tract 9703
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070.0,0,Kentucky,Adair County,Census Tract 9704.01
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261.0,0,Kentucky,Adair County,Census Tract 9704.02


In [65]:
df['tract'] = df['Geography'].str.split(',').str[0]   #similarly assigning a tract column
df[:5]

Unnamed: 0,Id,Id2,Geography,Population,NewCol,state,county,tract
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727.0,0,Kentucky,Adair County,Census Tract 9701
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722.0,0,Kentucky,Adair County,Census Tract 9702
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016.0,0,Kentucky,Adair County,Census Tract 9703
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070.0,0,Kentucky,Adair County,Census Tract 9704.01
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261.0,0,Kentucky,Adair County,Census Tract 9704.02


In [66]:
df['county'].unique()   #list the items in the county column

array([' Adair County', ' Allen County', ' Anderson County',
       ' Ballard County', ' Barren County', ' Bath County',
       ' Bell County', ' Boone County', ' Bourbon County', ' Boyd County',
       ' Boyle County', ' Bracken County', ' Breathitt County',
       ' Breckinridge County', ' Bullitt County', ' Butler County',
       ' Caldwell County', ' Calloway County', ' Campbell County',
       ' Carlisle County', ' Carroll County', ' Carter County',
       ' Casey County', ' Christian County', ' Clark County',
       ' Clay County', ' Clinton County', ' Crittenden County',
       ' Cumberland County', ' Daviess County', ' Edmonson County',
       ' Elliott County', ' Estill County', ' Fayette County',
       ' Fleming County', ' Floyd County', ' Franklin County',
       ' Fulton County', ' Gallatin County', ' Garrard County',
       ' Grant County', ' Graves County', ' Grayson County',
       ' Green County', ' Greenup County', ' Hancock County',
       ' Hardin County', ' Harlan 

And count how many times each unique value is in the data

In [67]:
df['county'].value_counts()  #count the value of each item in the unique column in descending order

 Jefferson County     191
 Fayette County        82
 Kenton County         41
 Campbell County       25
 Warren County         24
                     ... 
 Trimble County         2
 Wolfe County           2
 Cumberland County      2
 Hickman County         1
 Robertson County       1
Name: county, Length: 120, dtype: int64

In [68]:
df.columns

Index(['Id', 'Id2', 'Geography', 'Population', 'NewCol', 'state', 'county',
       'tract'],
      dtype='object')

In [69]:
col_order=['Id','Id2','tract','county']
df[col_order]

Unnamed: 0,Id,Id2,tract,county
0,1400000US21001970100,21001970100,Census Tract 9701,Adair County
1,1400000US21001970200,21001970200,Census Tract 9702,Adair County
2,1400000US21001970300,21001970300,Census Tract 9703,Adair County
3,1400000US21001970401,21001970401,Census Tract 9704.01,Adair County
4,1400000US21001970402,21001970402,Census Tract 9704.02,Adair County
...,...,...,...,...
1110,1400000US21239050106,21239050106,Census Tract 501.06,Woodford County
1111,1400000US21239050107,21239050107,Census Tract 501.07,Woodford County
1112,1400000US21239050200,21239050200,Census Tract 502,Woodford County
1113,1400000US21239050300,21239050300,Census Tract 503,Woodford County


In [79]:
new_cols=df.columns.insert(3,'NewCol2')

In [81]:
df["NewCol2"]=0  #check with this
df

Unnamed: 0,Id,Id2,Geography,Population,NewCol,state,county,tract,NewCol2
0,1400000US21001970100,21001970100,"Census Tract 9701, Adair County, Kentucky",1727.0,0,Kentucky,Adair County,Census Tract 9701,0
1,1400000US21001970200,21001970200,"Census Tract 9702, Adair County, Kentucky",1722.0,0,Kentucky,Adair County,Census Tract 9702,0
2,1400000US21001970300,21001970300,"Census Tract 9703, Adair County, Kentucky",3016.0,0,Kentucky,Adair County,Census Tract 9703,0
3,1400000US21001970401,21001970401,"Census Tract 9704.01, Adair County, Kentucky",4070.0,0,Kentucky,Adair County,Census Tract 9704.01,0
4,1400000US21001970402,21001970402,"Census Tract 9704.02, Adair County, Kentucky",4261.0,0,Kentucky,Adair County,Census Tract 9704.02,0
...,...,...,...,...,...,...,...,...,...
1110,1400000US21239050106,21239050106,"Census Tract 501.06, Woodford County, Kentucky",3261.0,0,Kentucky,Woodford County,Census Tract 501.06,0
1111,1400000US21239050107,21239050107,"Census Tract 501.07, Woodford County, Kentucky",3757.0,0,Kentucky,Woodford County,Census Tract 501.07,0
1112,1400000US21239050200,21239050200,"Census Tract 502, Woodford County, Kentucky",3533.0,0,Kentucky,Woodford County,Census Tract 502,0
1113,1400000US21239050300,21239050300,"Census Tract 503, Woodford County, Kentucky",1899.0,0,Kentucky,Woodford County,Census Tract 503,0


### More complex use of Indexing and String Manipulation -- Cleaning the Bedroom Field in the Craigslist data

In [82]:
import pandas as pd, numpy as np
cl = pd.read_csv('items.csv')
cl.head()

Unnamed: 0,neighborhood,title,price,bedrooms,pid,longitude,date,link,latitude,sqft,sourcepage
0,(SOMA / south beach),"1bed + Den, 1bath at Mission Bay",$2895,/ 1br - 950ft² -,4046628359,-122.399663,Sep 4 2013,/sfc/apa/4046628359.html,37.774623,/ 1br - 950ft² -,http://sfbay.craigslist.org/sfc/apa/
1,(SOMA / south beach),Love where you live!,$3354,/ 1br - 710ft² -,4046761563,,Sep 4 2013,/sfc/apa/4046761563.html,,/ 1br - 710ft² -,http://sfbay.craigslist.org/sfc/apa/
2,(inner sunset / UCSF),We Welcome Your Furry Friends! Call Today!,$2865,/ 1br - 644ft² -,4046661504,-122.470727,Sep 4 2013,/sfc/apa/4046661504.html,37.765739,/ 1br - 644ft² -,http://sfbay.craigslist.org/sfc/apa/
3,(financial district),Golden Gateway Commons | 2BR + office townhous...,$5500,/ 2br - 1450ft² -,4036170429,,Sep 4 2013,/sfc/apa/4036170429.html,,/ 2br - 1450ft² -,http://sfbay.craigslist.org/sfc/apa/
4,(lower nob hill),Experience Luxury Living in San Fransisco!,$3892,/ 2br -,4046732678,,Sep 4 2013,/sfc/apa/4046732678.html,,/ 2br -,http://sfbay.craigslist.org/sfc/apa/


This is a worked part of your assignment for this week.  Here is a code snippet that uses some of what we just learned above, and extends it to clean the bedroom field in this data.  Adapting it to clean sqft remains for you to do in the assignment...

Below are two different ways to do this.  The first involves looping over the index of the dataframe, and finding the beginning and end of the substring we are looking for, to isolate the bedrooms value.

In [98]:
'/ lbr - 950 ft² -'.find('br')

3

In [100]:
int('/ lbr - 950 ft² -'[1,3])

TypeError: string indices must be integers

In [87]:
for label in cl['bedrooms'].index:
    if isinstance(cl['bedrooms'][label], str) and not pd.isnull(cl['bedrooms'][label]): #if string & if not null
        end = cl['bedrooms'][label].find('br')  #if condititon is met: look at the end of this string and find br
        if end == -1:
            cl.loc[label,'bedrooms'] = np.nan
        else:
            start = cl['bedrooms'][label].find('/ ') + 2 #find forward slash
            cl.loc[label,'bedrooms'] = int(cl['bedrooms'][label][start: end])
cl['bedrooms'][:9]

0      1
1      1
2      1
3      2
4      2
5      1
6    NaN
7      1
8      2
Name: bedrooms, dtype: object

Here is a second way to do this that is much more elegant, and in some ways much simpler since it does not require looping over the index values.  It uses the apply() method to apply an element-wise function.  This is covered briefly on page 133 of Pandas for Data Analysis, and more exensively in online documentation.

In [102]:
# first we create a method that works on a single instance
def clean_br(value):
    ''' value - string from Craigslist entry that includes number of bedrooms
        returns - integer number of bedrooms
    '''
    if isinstance(value, str):
        end = value.find('br')
        if end == -1:    #if we look up string find in python, if substring doenst exist then return -1, indicating it cant find
            return None
        else:
            start = value.find('/') + 2
            return int(value[start:end])
        

In [103]:
# for example, we can apply it to a single string
clean_br('/ 1br - 950ft² -')

1

In [104]:
# and another
clean_br('/ 2br - 1450ft² -')

2

In [158]:
# really what we want to do is to apply it to all rows in a dataframe
# we can do this with a loop, similar to what we have above
cl = pd.read_csv('items.csv') #read excel file into cl

for i, row in cl[:5].iterrows(): 
    #row['bedrooms'] = clean_br(row['bedrooms'])
    print (i)
    print (row["bedrooms"])

0
   / 1br - 950ft² -    
1
   / 1br - 710ft² -    
2
   / 1br - 644ft² -    
3
   / 2br - 1450ft² -    
4
   / 2br -    


In [159]:
# really what we want to do is to apply it to all rows in a dataframe
# we can do this with a loop, similar to what we have above
cl = pd.read_csv('items.csv') #read excel file into cl

for i, row in cl.iterrows(): 
   cl.loc[i,'bedrooms'] = clean_br(row['bedrooms'])
    

In [160]:
# Let's look at the result.  We see that it doesn't change.  What went wrong?
# let's go back and fix it together
cl['bedrooms'][:9]

0       1
1       1
2       1
3       2
4       2
5       1
6    None
7       1
8       2
Name: bedrooms, dtype: object

In [161]:
# a second option is to use the apply() method.  
# apply() takes the name of a method as an argument, and applies that method to all rows in the dataframe
# the looping is implicit

cl = pd.read_csv('items.csv')  
cl['bedrooms'] = cl['bedrooms'].apply(clean_br) #explicit about hey i am changing this column
cl['bedrooms'][:9]

0    1.0
1    1.0
2    1.0
3    2.0
4    2.0
5    1.0
6    NaN
7    1.0
8    2.0
Name: bedrooms, dtype: float64

In [163]:
# if we have a very simple function, we can define it all in one line using the keyword lambda

cl['bedrooms_plus2'] = cl['bedrooms'].apply(lambda x : x+2)
cl['bedrooms_plus2'][:9]


0    3.0
1    3.0
2    3.0
3    4.0
4    4.0
5    3.0
6    NaN
7    3.0
8    4.0
Name: bedrooms_plus2, dtype: float64

In [162]:
# the above line is equivalent to:
def plus_two(x):
    return x+2

cl['bedrooms_plus2'] = cl['bedrooms'].apply(plus_two)
cl['bedrooms_plus2'][:9]

0    3.0
1    3.0
2    3.0
3    4.0
4    4.0
5    3.0
6    NaN
7    3.0
8    4.0
Name: bedrooms_plus2, dtype: float64

## Working with Pandas -- an in-class working session

OK, let's create a DataFrame from a dictionary, following the example on pg 116 of Python for Data Analysis (PDA).

In [9]:
from pandas import DataFrame
import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = DataFrame(data)

Explain the contents and structure of 'data'

What does 'DataFrame(data)' do? What if we did not begin that line with 'df ='?

Look at the contents of df, using just df by itself, and 'print df'.  

In [10]:
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


We can refer to a column in two ways:

In [11]:
df['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [12]:
df.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

We can step through the rows in a dataframe

In [13]:
for label in df.state.index:
    print(df.state[label])

Ohio
Ohio
Ohio
Nevada
Nevada


And find the index value within each entry of a specific substring

In [14]:
for label in df.state.index:
    print(df.state[label].find('io'))

2
2
2
-1
-1


In [15]:
for label in df.state.index:
    if df.state[label]=='Ohio':
        print(df.state[label])
    else:
        print('Missing')

Ohio
Ohio
Ohio
Missing
Missing


## Now Your Turn

Below are a series of questions, with the answers remaining for you to fill in by using pandas expressions that draw on the methods in Chapter 5.  You should not need to use anything more than the content of this chapter -- a subset of the methods summarized above, to do this exercise.  Hopefully you can complete it within class if you've been keeping up with the reading.

How can we get a quick statistical profile of all the numeric columns?

In [16]:
df.describe()

Unnamed: 0,year,pop
count,5.0,5.0
mean,2001.2,2.42
std,0.83666,0.864292
min,2000.0,1.5
25%,2001.0,1.7
50%,2001.0,2.4
75%,2002.0,2.9
max,2002.0,3.6


Can you get a profile of a column that is not numeric, like state? Try it.

In [17]:
#Yes
df['state'].describe()

count        5
unique       2
top       Ohio
freq         3
Name: state, dtype: object

How can we print the data types of each column?

In [18]:
df.dtypes

state     object
year       int64
pop      float64
dtype: object

How can we print just the column containing state names?

In [22]:
#df.state
df.loc[:,'state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

How can we get a list of the states in the DataFrame, without duplicates?

In [24]:

f=df['state'].unique()  #reurns a series object
print(f)

['Ohio' 'Nevada']


How can we get a count of how many rows we have in each state?

In [26]:
#df['state'].count()
df['state'].value_counts()

Ohio      3
Nevada    2
Name: state, dtype: int64

How can we compute the mean of population across all the rows?

In [27]:
df.mean()

year    2001.20
pop        2.42
dtype: float64

How can we compute the maximum population across all the rows?

In [30]:
df['pop'].max()

3.6

How can we compute the 20th percentile value of population? 

In [34]:
df['pop'].quantile(0.2)
#df['pop'].quantile(q=0.2)    #q is optional
#df.quantile(0.2)    #for all columns

1.6600000000000001

How can we compute a Boolean array indicating whether the state is 'Ohio'?

In [35]:
df["state"]=='Ohio'

0     True
1     True
2     True
3    False
4    False
Name: state, dtype: bool

How can we select and print just the rows for Ohio?

In [38]:
#df[(df)['state'].str.contains('Ohio')]
df[df['state']=='Ohio']

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


How can we create a new DataFrame containing only the Ohio records?

In [41]:
#df2=df[(df)['state'].str.contains('Ohio')]
#or
df2=df[df['state']=='Ohio']
df2

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


How can we select and print just the rows in which population is more than 2?

In [42]:
df[df['pop'] > 2] 

Unnamed: 0,state,year,pop
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


How could we compute the mean of population that is in Ohio, averaging across years?

In [46]:
#df2.mean()
df[df['state']=='Ohio'][['pop']].mean()

pop    2.266667
dtype: float64

How can we print the DataFrame, sorted by State and within State, by Population?

In [48]:
df.sort_values(['state','pop'])   #first state then pop ---> order of sorting

Unnamed: 0,state,year,pop
3,Nevada,2001,2.4
4,Nevada,2002,2.9
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


How can we print the row for Ohio, 2002, selecting on its values (not on row and column indexes)?

In [49]:
df[(df['state']=="Ohio") & (df['year']==2002)]

Unnamed: 0,state,year,pop
2,Ohio,2002,3.6


How can we use row and column indexing to set the population of Ohio in 2002 to 3.4?

In [50]:
df.loc[2,'pop'] = 3.4 #change the value of the cell

How can we use row and column indexing to append a new record for Utah, initially with no population or year? 

In [51]:
df.loc[5,"state"]="Utah"  #add wth this
df

Unnamed: 0,state,year,pop
0,Ohio,2000.0,1.5
1,Ohio,2001.0,1.7
2,Ohio,2002.0,3.4
3,Nevada,2001.0,2.4
4,Nevada,2002.0,2.9
5,Utah,,


How can we set the population to 2.5 and year to 2001 for the new record?

In [52]:
df.loc[5,"year"]=2001  #add population
df.loc[5,"pop"]=2.5  #add population
df

Unnamed: 0,state,year,pop
0,Ohio,2000.0,1.5
1,Ohio,2001.0,1.7
2,Ohio,2002.0,3.4
3,Nevada,2001.0,2.4
4,Nevada,2002.0,2.9
5,Utah,2001.0,2.5
