# Data Cleaning with Pandas

In [2]:
import pandas as pd

## Scenario

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

When working on a project involving data that can fit on our computer, we store it in a `data` directory.

```bash
cd <project_directory>  # example: cd ~/flatiron_ds/pandas-3
mkdir data
cd data
```

Note that `<project_directory>` in angle brackets is a _placeholder_. You should type the path to the actual location on your computer where you're working on this project. Do not literally type `<project_directory>` and _do not type the angle brackets_. You can see an example in the _comment_ to the right of the command above.

Now, we'll need to download the two data files that we need. We can do this at the command line:

```bash
wget https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
wget https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip
```

*Note:* If you do not have the `wget` command yet, you can install it with `brew install wget`, or use `curl <url> -o <filename>`.

Note that `%20` in a URL translates into a space. Even though you will *never put spaces in filenames*, you may need to deal with spaces that _other_ people have used in filenames.

There are two ways to handle the spaces in these filenames when referencing them at the command line.

#### 1. You can _escape_ the spaces by putting a backslash (`\`, remember _backslash is next to backspace_) before each one:

`unzip Real\ Property\ Sales.zip`

This is what happens if you tab-complete the filename in the terminal. Tab completion is your friend!

#### 2. You can put the entire filename in quotes:

`unzip "Real Property Sales.zip"`

Try unzipping these files with the `unzip` command. The `unzip` command takes one argument, the name of the file that you want to unzip.

In [28]:
sales_df = pd.read_csv('EXTR_RPSale.csv')

### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [29]:
sales_df.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,2687551,138860,110,08/21/2014,245000,20140828001436,,,,,...,3,6,3,N,N,N,N,1,8,
1,1235111,664885,40,07/09/1991,0,199203161090,71.0,1.0,664885.0,C,...,3,0,26,N,N,N,N,18,3,11
2,2704079,423943,50,10/11/2014,0,20141205000558,,,,,...,3,6,15,N,N,N,N,18,8,18 31 51
3,2584094,403700,715,01/04/2013,0,20130110000910,,,,,...,3,6,15,N,N,N,N,11,8,18 31 38
4,1056831,951120,900,04/20/1989,85000,198904260448,117.0,53.0,951120.0,P,...,3,0,2,N,N,N,N,1,9,49


### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [69]:
small_sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']].copy()

In [70]:
small_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2026164 entries, 0 to 2026163
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 61.8+ MB


In [27]:
sales_df.DocumentDate.describe()

count        2026164
unique         14257
top       06/11/2014
freq             951
Name: DocumentDate, dtype: object

In [18]:
sales_df.Major.describe()

count     2026164
unique      28590
top             0
freq         9362
Name: Major, dtype: int64

In [9]:
bldg_df = pd.read_csv('EXTR_ResBldg.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Another warning! Which column has index 11?

In [19]:
bldg_df.ZipCode.describe()

count     468423
unique       292
top        98115
freq       11579
Name: ZipCode, dtype: object

In [13]:
bldg_df.columns[11]

'ZipCode'

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [11]:
bldg_df.head().T

Unnamed: 0,0,1,2,3,4
Major,62304,62304,62304,62306,62306
Minor,9347,9378,9394,9034,9036
BldgNbr,1,1,1,1,1
NbrLivingUnits,1,1,1,1,1
Address,907 SW 104TH ST 98146,10215 11TH AVE SW 98146,510 SW 108TH ST 98146,18809 SE 109TH ST 98027,11128 RENTON-ISSAQUAH RD SE 98027
BuildingNumber,907,10215,510,18809,11128
Fraction,,,,,
DirectionPrefix,SW,,SW,SE,
StreetName,104TH,11TH,108TH,109TH,RENTON-ISSAQUAH
StreetType,ST,AVE,ST,ST,RD


### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [67]:
small_bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']].copy()

In [68]:
small_bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 512114 entries, 0 to 512113
Data columns (total 4 columns):
Major            512114 non-null int64
Minor            512114 non-null int64
SqFtTotLiving    512114 non-null int64
ZipCode          468423 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB


In [71]:
sales_data = pd.merge(small_sales_df, small_bldg_df, on=['Major', 'Minor'])

In [73]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [74]:
sales_data.Major.describe()

count    1.446234e+06
mean     4.481679e+05
std      2.870434e+05
min      4.000000e+01
25%      2.018700e+05
50%      3.832500e+05
75%      7.228500e+05
max      9.906000e+05
Name: Major, dtype: float64

In [75]:
sales_data.Minor.describe()

count    1.446234e+06
mean     1.603629e+03
std      2.885767e+03
min      1.000000e+00
25%      1.200000e+02
50%      3.350000e+02
75%      1.060000e+03
max      9.692000e+03
Name: Minor, dtype: float64

In [40]:
sales_data.ZipCode.describe()

count     1286267
unique        282
top         98042
freq        32844
Name: ZipCode, dtype: object

In [43]:
sales_data.Major.dtype

dtype('O')

### Error!

Why are we seeing an error when we try to join the dataframes?

<table>
    <tr>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013160 entries, 0 to 2013159
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.4+ MB</pre></td>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511359 entries, 0 to 511358
Data columns (total 4 columns):
Major            511359 non-null int64
Minor            511359 non-null int64
SqFtTotLiving    511359 non-null int64
ZipCode          468345 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB
</pre></td>
    </tr>
</table>

Review the error message in light of the above:

* `ValueError: You are trying to merge on object and int64 columns.`

In [76]:
sales_df['Major'] = pd.to_numeric(sales_df['Major'])

### Error!

Note the useful error message above:

`ValueError: Unable to parse string "      " at position 936643`

In this case, we want to treat non-numeric values as missing values. Let's see if there's a way to change how the `pd.to_numeric` function handles errors.

In [4]:
# The single question mark means "show me the docstring"
pd.to_numeric?

Here's the part that we're looking for:
```
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input
```

Let's try setting the `errors` parameter to `'coerce'`.

In [84]:
small_sales_df['Major'] = pd.to_numeric(small_sales_df['Major'], errors='coerce')

Did it work?

In [86]:
small_sales_df.Major.dtype

dtype('float64')

In [87]:
small_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2026164 entries, 0 to 2026163
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 61.8+ MB


It worked! Let's do the same thing with the `Minor` parcel number.

In [81]:
small_sales_df['Minor'] = pd.to_numeric(small_sales_df['Minor'], errors='coerce')

In [88]:
small_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2026164 entries, 0 to 2026163
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 61.8+ MB


Now, let's try our join again.

In [122]:
sales_data = pd.merge(small_sales_df, small_bldg_df, on=['Major', 'Minor'])

In [123]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [124]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1446234 entries, 0 to 1446233
Data columns (total 6 columns):
Major            1446234 non-null float64
Minor            1446234 non-null float64
DocumentDate     1446234 non-null object
SalePrice        1446234 non-null int64
SqFtTotLiving    1446234 non-null int64
ZipCode          1329081 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 77.2+ MB


We can see right away that we're missing zip codes for many of the sales transactions. (1321536 non-null entries for ZipCode is fewer than the 1436772 entries in the dataframe.) 

In [125]:
sales_data.loc[sales_data['ZipCode'].isna()].head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
91,334330.0,1343.0,05/30/2006,0,4600,
92,334330.0,1343.0,05/30/2006,0,4600,
93,334330.0,1343.0,11/26/2001,0,4600,
94,334330.0,1343.0,05/30/2006,0,4600,
95,334330.0,1343.0,06/30/2016,0,4600,


Because we are interested in finding houses in Seattle zip codes, we will need to drop the rows with missing zip codes.

In [126]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]

sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1329081 entries, 0 to 1446233
Data columns (total 6 columns):
Major            1329081 non-null float64
Minor            1329081 non-null float64
DocumentDate     1329081 non-null object
SalePrice        1329081 non-null int64
SqFtTotLiving    1329081 non-null int64
ZipCode          1329081 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 71.0+ MB


# Your turn: Data Cleaning with Pandas

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

In [127]:
sales_data = sales_data.loc[~sales_data['SalePrice'].isna(), :]
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1329081 entries, 0 to 1446233
Data columns (total 6 columns):
Major            1329081 non-null float64
Minor            1329081 non-null float64
DocumentDate     1329081 non-null object
SalePrice        1329081 non-null int64
SqFtTotLiving    1329081 non-null int64
ZipCode          1329081 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 71.0+ MB


In [129]:
sales_data1 = sales_data.loc[sales_data['SalePrice'] != 0, :].copy()
sales_data1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 872118 entries, 0 to 1446233
Data columns (total 6 columns):
Major            872118 non-null float64
Minor            872118 non-null float64
DocumentDate     872118 non-null object
SalePrice        872118 non-null int64
SqFtTotLiving    872118 non-null int64
ZipCode          872118 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 46.6+ MB


In [132]:
sales_data2 = sales_data1.loc[~sales_data['SqFtTotLiving'].isna(), :].copy()
sales_data2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 872118 entries, 0 to 1446233
Data columns (total 6 columns):
Major            872118 non-null float64
Minor            872118 non-null float64
DocumentDate     872118 non-null object
SalePrice        872118 non-null int64
SqFtTotLiving    872118 non-null int64
ZipCode          872118 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 46.6+ MB


In [133]:
sales_data3 = sales_data2.loc[sales_data['SqFtTotLiving'] != 0, :].copy()
sales_data3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 872106 entries, 0 to 1446233
Data columns (total 6 columns):
Major            872106 non-null float64
Minor            872106 non-null float64
DocumentDate     872106 non-null object
SalePrice        872106 non-null int64
SqFtTotLiving    872106 non-null int64
ZipCode          872106 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 46.6+ MB


### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [136]:
a = 'string'
a[0]
b = []
b.append(a[0])
b

['s']

In [149]:
a=list("98199-3014")
a[0]
len(a)

10

In [143]:
del(a[5:10])
a

['9', '8', '1', '9', '9']

In [145]:
s = "".join(a)
s

'98199'

In [102]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

14502    98199-3014
14503    98199-3014
14504    98199-3014
23823    98031-3173
24355    98042-3001
Name: ZipCode, dtype: object

In [147]:
sales_data['ZipCode'][14502]

'98199-3014'

In [158]:
a='98199'
str(a)

'98199'

In [None]:
def short(x):
    a = list(str(x))
    if len(a)>5:
        del(a[5:10])
        s = "".join(a)
    else:
        s = str(x)
    return s

In [266]:
p = '98132-0000'
short(p)

'98132'

In [164]:
sales_data3['ZipCode']= sales_data3['ZipCode'].map(short).copy()
sales_data3['ZipCode'][14502]
        

'98199'

In [165]:
sales_data3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 872106 entries, 0 to 1446233
Data columns (total 6 columns):
Major            872106 non-null float64
Minor            872106 non-null float64
DocumentDate     872106 non-null object
SalePrice        872106 non-null int64
SqFtTotLiving    872106 non-null int64
ZipCode          872106 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 86.6+ MB


### 3. Add a column for PricePerSqFt



In [207]:
sales_data3['PricePerSqFt'] = sales_data3['SalePrice']/sales_data3['SqFtTotLiving']
sales_data3['PricePerSqFt'] 

0           164.429530
1            73.355705
2             9.855034
6           100.000000
7           132.812500
11          260.955056
12          200.561798
13           79.775281
14          296.629213
16           68.268293
17          214.390244
19          151.583710
20          167.375566
21           87.307692
22          167.375566
23           48.642534
25          102.916667
26           94.093137
27          142.132353
29          177.551020
31          137.755102
32          214.285714
33          151.020408
34           89.201878
36          158.685446
38          118.577075
39          154.150198
40         1166.666667
42          119.444444
44          150.000000
              ...     
1446190     227.174030
1446193      55.184049
1446194     140.625000
1446195     384.615385
1446197     167.465753
1446198      37.373737
1446199      43.216080
1446200      73.732719
1446201     116.477273
1446202     103.658537
1446203     185.095420
1446204      56.666667
1446208    

In [168]:
sales_data3['SalePrice'][0]

245000

In [169]:
sales_data3['SqFtTotLiving'][0]

1490

In [171]:
245000/1490

164.4295302013423

### 4. Subset the data to 2019 sales only.

We can assume that the DocumentDate is approximately the sale date.

In [239]:
sales_data3['DATE']=pd.to_datetime(sales_data3['DocumentDate'])

In [243]:
sales_data3['DATE']

0         2014-08-21
1         1989-06-12
2         2005-01-16
6         1999-07-15
7         2001-01-08
11        2013-07-03
12        2013-02-21
13        1995-10-13
14        2007-02-22
16        1994-03-23
17        2017-03-29
19        2012-06-07
20        2013-08-26
21        1994-06-27
22        2013-08-22
23        1993-11-10
25        1998-03-06
26        1995-08-01
27        2004-09-13
29        2001-04-09
31        1996-10-31
32        2004-04-30
33        2012-12-21
34        2015-07-14
36        2015-12-21
38        1997-02-21
39        2000-11-28
40        2004-05-10
42        2013-10-24
44        2004-09-20
             ...    
1446190   2014-08-05
1446193   1986-12-11
1446194   2003-07-03
1446195   2016-11-04
1446197   2002-10-07
1446198   1986-12-12
1446199   1988-03-09
1446200   1997-11-10
1446201   2003-05-06
1446202   1994-07-25
1446203   2005-04-15
1446204   1984-06-01
1446208   2003-10-06
1446209   1986-06-25
1446210   1988-12-23
1446214   1999-12-22
1446216   198

In [244]:
sales_data3['Year'] = sales_data3['DATE'].dt.year

In [245]:
sales_data3['Year'].head()

0    2014
1    1989
2    2005
6    1999
7    2001
Name: Year, dtype: int64

In [246]:
sales_data3

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,seattle,DATE,Year,Seattle
0,138860.0,110.0,08/21/2014,245000,1490,98002,164.429530,False,2014-08-21,2014,
1,138860.0,110.0,06/12/1989,109300,1490,98002,73.355705,False,1989-06-12,1989,
2,138860.0,110.0,01/16/2005,14684,1490,98002,9.855034,False,2005-01-16,2005,
6,423943.0,50.0,07/15/1999,96000,960,98092,100.000000,False,1999-07-15,1999,
7,423943.0,50.0,01/08/2001,127500,960,98092,132.812500,False,2001-01-08,2001,
11,403700.0,715.0,07/03/2013,464500,1780,98008,260.955056,False,2013-07-03,2013,
12,403700.0,715.0,02/21/2013,357000,1780,98008,200.561798,False,2013-02-21,2013,
13,403700.0,715.0,10/13/1995,142000,1780,98008,79.775281,False,1995-10-13,1995,
14,403700.0,715.0,02/22/2007,528000,1780,98008,296.629213,False,2007-02-22,2007,
16,98400.0,380.0,03/23/1994,139950,2050,98058,68.268293,False,1994-03-23,1994,


In [247]:
sales_data3['Year'].value_counts()

2004    42525
2005    41320
2003    40401
1998    36795
1999    35537
2002    34911
1997    34753
2006    34625
2000    32889
2001    32357
1994    31091
1992    30616
1996    30300
1993    30052
2016    28359
2017    28010
2015    27976
2013    27339
1995    26963
2007    26483
2014    25887
2018    24773
2012    22821
2011    17680
2008    16899
2010    16630
2009    16118
1989    10133
1991     8985
2019     8351
        ...  
1984     4990
1983     4311
1982     2537
1981       99
1980       10
1979       10
1978       10
1975        9
1977        8
1964        7
1976        7
1966        5
1965        5
1974        4
1971        4
1970        4
1967        4
1973        4
1962        3
1972        3
1969        3
1968        3
1959        1
1954        1
1955        1
2021        1
1960        1
1961        1
1963        1
1934        1
Name: Year, Length: 65, dtype: int64

In [248]:
sales_data3.loc[sales_data3['Year']== 2019]

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,seattle,DATE,Year,Seattle
379,785996.0,190.0,05/16/2019,808000,3000,98072,269.333333,False,2019-05-16,2019,
557,367890.0,30.0,03/13/2019,923000,2550,98144,361.960784,False,2019-03-13,2019,True
646,214090.0,110.0,03/06/2019,390000,1780,98042,219.101124,False,2019-03-06,2019,
905,945920.0,125.0,01/29/2019,1750000,1400,98118,1250.000000,False,2019-01-29,2019,True
1084,121800.0,175.0,03/26/2019,556500,1460,98166,381.164384,False,2019-03-26,2019,True
1211,145360.0,2285.0,05/02/2019,690000,1910,98125,361.256545,False,2019-05-02,2019,True
1246,350160.0,125.0,02/06/2019,935000,2460,98117,380.081301,False,2019-02-06,2019,True
1345,4000.0,228.0,04/26/2019,369000,1560,98168,236.538462,False,2019-04-26,2019,True
1463,259772.0,70.0,03/22/2019,525000,2570,98058,204.280156,False,2019-03-22,2019,
1476,85900.0,115.0,03/21/2019,563750,1690,98106,333.579882,False,2019-03-21,2019,True


### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

In [197]:
sales_data3['ZipCode'][34]

'98188'

In [179]:
sales_data3['ZipCode'][0]   

'98002'

In [None]:
ziplist = list(sales_data3['ZipCode'])
zip = list(set(ziplist))
zip

zip1 = list(filter(lambda x: len(x) == 5, zip))

In [188]:
zip1.sort()
int(zip1[0])

28028

In [223]:
zip2 = list(filter(lambda x: int(x)>98100, zip1))
zip2

zip3 = list(filter(lambda x: int(x)<98200, zip2))
zip3

['98101',
 '98102',
 '98103',
 '98104',
 '98105',
 '98106',
 '98107',
 '98108',
 '98109',
 '98111',
 '98112',
 '98113',
 '98115',
 '98116',
 '98117',
 '98118',
 '98119',
 '98121',
 '98122',
 '98125',
 '98126',
 '98132',
 '98133',
 '98134',
 '98136',
 '98144',
 '98146',
 '98148',
 '98155',
 '98157',
 '98166',
 '98168',
 '98176',
 '98177',
 '98178',
 '98188',
 '98189',
 '98198',
 '98199']

In [228]:
#for i in zip3:
sales_data3['ZipCode'][34]== zip3[-4]

True

In [249]:
sales_data3['Seattle']= False

def seattle(zipcode):
    for i in zip3:
        if zipcode == i:
            return True

sales_data3['Seattle']= sales_data3['ZipCode'].map(seattle)
sales_data3['Seattle']

0          None
1          None
2          None
6          None
7          None
11         None
12         None
13         None
14         None
16         None
17         None
19         None
20         None
21         None
22         None
23         None
25         None
26         None
27         None
29         None
31         None
32         None
33         None
34         True
36         True
38         None
39         None
40         None
42         None
44         None
           ... 
1446190    None
1446193    True
1446194    None
1446195    True
1446197    True
1446198    True
1446199    None
1446200    True
1446201    None
1446202    None
1446203    None
1446204    None
1446208    None
1446209    None
1446210    None
1446214    True
1446216    None
1446217    True
1446219    True
1446220    True
1446221    True
1446223    True
1446224    None
1446225    True
1446226    None
1446227    True
1446228    None
1446229    True
1446232    None
1446233    None
Name: Seattle, Length: 8

In [251]:
sales_data3['Seattle'].value_counts()

True    342918
Name: Seattle, dtype: int64

In [252]:
sales_data3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 872106 entries, 0 to 1446233
Data columns (total 11 columns):
Major            872106 non-null float64
Minor            872106 non-null float64
DocumentDate     872106 non-null object
SalePrice        872106 non-null int64
SqFtTotLiving    872106 non-null int64
ZipCode          872106 non-null object
PricePerSqFt     872106 non-null float64
seattle          872106 non-null bool
DATE             872106 non-null datetime64[ns]
Year             872106 non-null int64
Seattle          342918 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(3), object(3)
memory usage: 114.0+ MB


In [253]:
sales_data3.loc[sales_data3['Seattle']==True]

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt,seattle,DATE,Year,Seattle
34,638580.0,110.0,07/14/2015,190000,2130,98188,89.201878,False,2015-07-14,2015,True
36,638580.0,110.0,12/21/2015,338000,2130,98188,158.685446,False,2015-12-21,2015,True
177,510140.0,4256.0,08/01/2003,235000,870,98115,270.114943,False,2003-08-01,2003,True
178,510140.0,4256.0,03/20/1996,135000,870,98115,155.172414,False,1996-03-20,1996,True
180,510140.0,4256.0,07/09/2013,275000,870,98115,316.091954,False,2013-07-09,2013,True
190,8400.0,127.0,03/05/1984,41000,2610,98146,15.708812,False,1984-03-05,1984,True
191,162304.0,9355.0,05/13/2005,237000,1490,98168,159.060403,False,2005-05-13,2005,True
194,162304.0,9355.0,06/20/2002,179000,1490,98168,120.134228,False,2002-06-20,2002,True
202,397170.0,951.0,11/13/1997,137500,1430,98155,96.153846,False,1997-11-13,1997,True
203,397170.0,951.0,05/07/1996,155000,1430,98155,108.391608,False,1996-05-07,1996,True


In [202]:
sales_data3.loc[sales_data3['ZipCode']== zip3[0]]

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
1409492,84400.0,371.0,08/08/2014,110000,1510,98101,72.847682


In [219]:
sales_data3.loc[sales_data3['ZipCode']== zip3[1]] 

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### 6. What is the mean price per square foot for a house sold in Seattle in 2019?

Don't just type the answer. Type code that generates the answer as output!

In [264]:

grouped = sales_data3.groupby(['Seattle', 'Year'])['PricePerSqFt'].mean()
grouped


#sales_data3.loc[sales_data3['Seattle']==True & sales_data3['Year']==2019]

Seattle  Year
True     1954      0.434783
         1955     11.585366
         1959      5.861244
         1961      4.661638
         1962      9.673289
         1963     11.073826
         1964      6.875179
         1965     14.076087
         1966      7.145402
         1967     16.000000
         1968     11.965148
         1969      8.552632
         1970      9.200445
         1971      7.934889
         1972      1.804124
         1973      6.927880
         1974     12.376238
         1975     15.011249
         1976      7.843137
         1977     16.298701
         1978     19.931177
         1979     23.016768
         1980      5.924751
         1981     54.173482
         1982     49.379959
         1983     51.548805
         1984     50.918482
         1985     52.276541
         1986     53.747360
         1987     55.031941
                    ...    
         1991     96.579571
         1992    100.254853
         1993     93.421019
         1994     95.595331
      