# Data Cleaning with Pandas

In [1]:
import pandas as pd
import numpy as np

### Changing data

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [2]:
import pandas as pd

uci = pd.read_csv('data/heart.csv')

In [3]:
def successor(x):
    return x + 1

In [4]:
uci.applymap(successor).head()
#apply does something to everything in a row, and map does something to everything in a column
#you do applymap on a dataframe

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,2,4,146,234,2,1,151,1,3.3,1,1,2,2
1,38,2,3,131,251,1,2,188,1,4.5,1,1,3,2
2,42,1,2,131,205,1,1,173,1,2.4,3,1,3,2
3,57,2,2,121,237,1,2,179,1,1.8,3,1,3,2
4,58,1,1,121,355,1,2,164,2,1.6,3,1,3,2


The `.map()` method takes a function as input that it will then apply to every entry in the Series.

In [5]:
uci['age'].map(successor).tail(10)
#uci['age'] = uci['age'].map(successor).tail(10)
#^ code would replace/update the column data
#the ran code just returns a temp answer that doesn't affect dataframe

293    68
294    45
295    64
296    64
297    60
298    58
299    46
300    69
301    58
302    58
Name: age, dtype: int64

In [6]:
new_output = _
#single underscore references most recent output
new_output

293    68
294    45
295    64
296    64
297    60
298    58
299    46
300    69
301    58
302    58
Name: age, dtype: int64

In [7]:
type(_)
#type of most recent output

pandas.core.series.Series

In [10]:
type(Out[7])

type

In [11]:
type(Out[6])

pandas.core.series.Series

#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [None]:
def round_it(x, y):
    return round(x + y)

In [None]:
uci['oldpeak'].map(lambda x,y: round(x+y))[:4]

In [None]:
def round_it(x):
    return round(x)
#same as lambda function below, function with a name vs without a name

In [12]:
uci['oldpeak'].map(lambda x: round(x))[:4]

0    2
1    4
2    1
3    1
Name: oldpeak, dtype: int64

Exercise: Use an anonymous function to turn the entries in age to strings

In [14]:
uci.age

0      63
1      37
2      41
3      56
4      57
5      57
6      56
7      44
8      52
9      57
10     54
11     48
12     49
13     64
14     58
15     50
16     58
17     66
18     43
19     69
20     59
21     44
22     42
23     61
24     40
25     71
26     59
27     51
28     65
29     53
       ..
273    58
274    47
275    52
276    58
277    57
278    58
279    61
280    42
281    52
282    59
283    40
284    61
285    46
286    59
287    57
288    57
289    55
290    61
291    58
292    58
293    67
294    44
295    63
296    63
297    59
298    57
299    45
300    68
301    57
302    57
Name: age, Length: 303, dtype: int64

In [15]:
uci.age.map(lambda x: str(x))

0      63
1      37
2      41
3      56
4      57
5      57
6      56
7      44
8      52
9      57
10     54
11     48
12     49
13     64
14     58
15     50
16     58
17     66
18     43
19     69
20     59
21     44
22     42
23     61
24     40
25     71
26     59
27     51
28     65
29     53
       ..
273    58
274    47
275    52
276    58
277    57
278    58
279    61
280    42
281    52
282    59
283    40
284    61
285    46
286    59
287    57
288    57
289    55
290    61
291    58
292    58
293    67
294    44
295    63
296    63
297    59
298    57
299    45
300    68
301    57
302    57
Name: age, Length: 303, dtype: object

In [None]:
uci.age.map(lambda x: 9/5*x - 32)
#example of basic arithmitic using lambda/anonymous function
#lambda can't do more than one line, define a function if multi-line
#and if iterative, define a function, and use a for loop

## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [18]:
uci.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d010860>

In [19]:
_.boxplot

<bound method boxplot_frame_groupby of <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d010860>>

#### `.groups` and `.get_group()`

In [24]:
uci['sex_str'] = uci.sex.map(lambda x: 'M' if x == 1 else 'F')

In [25]:
uci.groupby('sex_str').groups.keys()

dict_keys(['F', 'M'])

In [20]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [21]:
uci.groupby('sex').groups.keys()

dict_keys([0, 1])

In [22]:
uci.groupby('target').groups.keys()

dict_keys([0, 1])

In [23]:
uci.groupby('sex').get_group(0)  # .tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1
15,50,0,2,120,219,0,1,158,0,1.6,1,0,2,1
16,58,0,2,120,340,0,1,172,0,0.0,2,0,2,1
17,66,0,3,150,226,0,1,114,0,2.6,0,0,2,1
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2,1
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2,1


### Aggregating

In [26]:
uci.groupby('sex').std()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,9.409396,0.972427,19.311119,65.088946,0.332455,0.55715,20.047969,0.422503,1.119844,0.593736,0.881026,0.44129,0.435286
1,8.883803,1.059064,16.658246,42.782392,0.366955,0.510754,24.130882,0.484505,1.174632,0.627378,1.074082,0.659949,0.498626


In [27]:
uci.groupby('sex_str').std()

Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
F,9.409396,0.0,0.972427,19.311119,65.088946,0.332455,0.55715,20.047969,0.422503,1.119844,0.593736,0.881026,0.44129,0.435286
M,8.883803,0.0,1.059064,16.658246,42.782392,0.366955,0.510754,24.130882,0.484505,1.174632,0.627378,1.074082,0.659949,0.498626


In [28]:
uci.groupby('sex').mean()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,55.677083,1.041667,133.083333,261.302083,0.125,0.572917,151.125,0.229167,0.876042,1.427083,0.552083,2.125,0.75
1,53.758454,0.932367,130.94686,239.289855,0.15942,0.507246,148.961353,0.371981,1.115459,1.386473,0.811594,2.400966,0.449275


Exercise: Tell me the average cholesterol level for those with heart disease.

In [29]:
uci.groupby('target').get_group(1).chol.mean()

242.23030303030302

In [31]:
uci.groupby('target').get_group(1).chol.describe()

count    165.000000
mean     242.230303
std       53.552872
min      126.000000
25%      208.000000
50%      234.000000
75%      267.000000
max      564.000000
Name: chol, dtype: float64

In [32]:
uci.groupby('target').get_group(1).chol.sort_values()

111    126
53     141
151    149
162    157
94     160
9      168
164    175
163    175
27     175
31     177
35     177
136    178
149    180
58     182
65     183
62     186
157    192
5      192
117    193
102    195
104    196
128    196
29     197
155    197
144    197
87     197
30     198
124    199
24     199
8      199
      ... 
153    278
14     283
93     288
6      294
132    295
140    295
83     298
25     302
51     302
141    303
59     303
120    303
36     304
134    306
40     308
81     308
61     309
112    313
98     315
82     318
44     321
110    325
45     325
16     340
161    342
4      354
39     360
96     394
28     417
85     564
Name: chol, Length: 165, dtype: int64

In [33]:
uci.groupby('target').get_group(1).chol.sort_values(ascending=False).head()

85    564
28    417
96    394
39    360
4     354
Name: chol, dtype: int64

In [34]:
uci.groupby('target').get_group(1).chol.sort_values().head()

111    126
53     141
151    149
162    157
94     160
Name: chol, dtype: int64

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [35]:
uci.pivot(values='sex', columns='target').head()
#can turn values into columns
#play around with syntax to get something useful

target,0,1
0,,1.0
1,,1.0
2,,0.0
3,,1.0
4,,0.0


### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [36]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'HP'])

In [38]:
toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [39]:
toy2

Unnamed: 0,age,HP
0,63,100
1,33,200


In [37]:
toy1.join(toy2.set_index('age'),
          on='age',
          lsuffix='_A',
          rsuffix='_B').head()
#note left join by default

Unnamed: 0,age,HP_A,HP_B
0,63,142,100
1,33,47,200


In [40]:
toy1.set_index('age').join(toy2.set_index('age'),
          lsuffix='_A',
          rsuffix='_B').head()

Unnamed: 0_level_0,HP_A,HP_B
age,Unnamed: 1_level_1,Unnamed: 2_level_1
63,142,100
33,47,200


### `.merge()`

In [43]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [44]:
states = pd.read_csv('data/states.csv', index_col=0)
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


In [45]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


In [46]:
ds_chars.join(states.set_index('state'),
             on='home_state')
#join uses merge underneath, just different how=

Unnamed: 0,name,HP,home_state,nickname,capital
0,greg,200,WA,evergreen,Olympia
1,miles,200,WA,evergreen,Olympia
2,alan,170,TX,alamo,Austin
3,alison,300,DC,district,Washington
4,rachel,200,TX,alamo,Austin


### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [47]:
pd.concat([ds_chars, states], sort=False)

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,,,
1,miles,200.0,WA,,,
2,alan,170.0,TX,,,
3,alison,300.0,DC,,,
4,rachel,200.0,TX,,,
0,,,,WA,evergreen,Olympia
1,,,,TX,alamo,Austin
2,,,,DC,district,Washington
3,,,,OH,buckeye,Columbus
4,,,,OR,beaver,Salem


In [49]:
states.columns = ['home_state', 'nickname', 'capital']
pd.concat([ds_chars, states], sort=False)

Unnamed: 0,name,HP,home_state,nickname,capital
0,greg,200.0,WA,,
1,miles,200.0,WA,,
2,alan,170.0,TX,,
3,alison,300.0,DC,,
4,rachel,200.0,TX,,
0,,,WA,evergreen,Olympia
1,,,TX,alamo,Austin
2,,,DC,district,Washington
3,,,OH,buckeye,Columbus
4,,,OR,beaver,Salem


### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [50]:
ds_chars.head()

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [51]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])
#went from wide to long format

Unnamed: 0,name,variable,value
0,greg,HP,200
1,miles,HP,200
2,alan,HP,170
3,alison,HP,300
4,rachel,HP,200
5,greg,home_state,WA
6,miles,home_state,WA
7,alan,home_state,TX
8,alison,home_state,DC
9,rachel,home_state,TX


In [53]:
ds_chars.unstack()

name        0      greg
            1     miles
            2      alan
            3    alison
            4    rachel
HP          0       200
            1       200
            2       170
            3       300
            4       200
home_state  0        WA
            1        WA
            2        TX
            3        DC
            4        TX
dtype: object

# Data Cleaning
## Scenario

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

We'll need to download the two data files that we need. We can do this at the command line:

In [59]:
!brew install wget

[32m==>[0m [1mInstalling dependencies for wget: [32mgettext[39m, [32mlibunistring[39m, [32mlibidn2[39m and [32mopenssl@1.1[39m[0m
[32m==>[0m [1mInstalling wget dependency: [32mgettext[39m[0m
[34m==>[0m [1mDownloading https://homebrew.bintray.com/bottles/gettext-0.20.1.mojave.bottl[0m
[34m==>[0m [1mDownloading from https://akamai.bintray.com/fa/fa2096f80238b8f4d9f3724d52662[0m
######################################################################## 100.0%
[34m==>[0m [1mPouring gettext-0.20.1.mojave.bottle.tar.gz[0m
[34m==>[0m [1mCaveats[0m
gettext is keg-only, which means it was not symlinked into /usr/local,
because macOS provides the BSD gettext library & some software gets confused if both are in the library path.

If you need to have gettext first in your PATH run:
  echo 'export PATH="/usr/local/opt/gettext/bin:$PATH"' >> ~/.bash_profile

For compilers to find gettext you may need to set:
  export LDFLAGS="-L/usr/local/opt/gettext/lib"
  export CPP

In [60]:
!cd data
!wget https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
!wget https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip

--2019-09-04 13:37:32--  https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
Resolving aqua.kingcounty.gov (aqua.kingcounty.gov)... 146.129.240.28
Connecting to aqua.kingcounty.gov (aqua.kingcounty.gov)|146.129.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 125329626 (120M) [application/x-zip-compressed]
Saving to: ‘Real Property Sales.zip’


2019-09-04 13:43:20 (353 KB/s) - ‘Real Property Sales.zip’ saved [125329626/125329626]

--2019-09-04 13:43:20--  https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip
Resolving aqua.kingcounty.gov (aqua.kingcounty.gov)... 146.129.240.28
Connecting to aqua.kingcounty.gov (aqua.kingcounty.gov)|146.129.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24608118 (23M) [application/x-zip-compressed]
Saving to: ‘Residential Building.zip’


2019-09-04 13:44:15 (445 KB/s) - ‘Residential Building.zip’ saved [24608118/24608118]



*Note:* If you do not have the `wget` command yet, you can install it with `brew install wget`, or use `curl <url> -O <filename>`.

Note that `%20` in a URL translates into a space. Even though you should *never put spaces in filenames*, you may need to deal with spaces that _other_ people have used in filenames.

There are two ways to handle the spaces in these filenames when referencing them at the command line.

#### 1. You can _escape_ the spaces by putting a backslash (`\`, remember _backslash is next to backspace_) before each one:

`unzip Real\ Property\ Sales.zip`

This is what happens if you tab-complete the filename in the terminal. Tab completion is your friend!

#### 2. You can put the entire filename in quotes:

`unzip "Real Property Sales.zip"`

Try unzipping these files with the `unzip` command. The `unzip` command takes one argument, the name of the file that you want to unzip.

In [67]:
!mv data/Real Property Sales.zip
!mv data/Residential Building.zip

usage: mv [-f | -i | -n] [-v] source target
       mv [-f | -i | -n] [-v] source ... directory
mv: rename data/Residential to Building.zip: No such file or directory


In [None]:
!unzip Real\ Property\ Sales.zip
!cd ..

In [None]:
sales_df = pd.read_csv('data/Real Property Sales.zip')

In [None]:
sales_df.head()

In [None]:
sales_df.describe()

### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [None]:
sales_df.head().T

### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [None]:
small_sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']].copy()

In [None]:
small_sales_df.info()

In [None]:
bldg_df = pd.read_csv('data/Residential Building.zip')

### Another warning! Which column has index 11?

In [None]:
bldg_df.columns[11]

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [None]:
bldg_df.head().T

### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [None]:
small_bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']].copy()

In [None]:
small_bldg_df.info()

In [None]:
sales_data = pd.merge(small_sales_df, small_bldg_df, on=['Major', 'Minor'])

In [None]:
pd.merge?

In [None]:
sales_data.head()

### Error!

Why are we seeing an error when we try to join the dataframes?

<table>
    <tr>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013160 entries, 0 to 2013159
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.4+ MB</pre></td>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511359 entries, 0 to 511358
Data columns (total 4 columns):
Major            511359 non-null int64
Minor            511359 non-null int64
SqFtTotLiving    511359 non-null int64
ZipCode          468345 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB
</pre></td>
    </tr>
</table>

Review the error message in light of the above:

* `ValueError: You are trying to merge on object and int64 columns.`

In [None]:
pd.to_numeric(sales_df['Major'])

### Error!

Note the useful error message above:

`ValueError: Unable to parse string "      " at position 936643`

In this case, we want to treat non-numeric values as missing values. Let's see if there's a way to change how the `pd.to_numeric` function handles errors.

In [None]:
# The single question mark means "show me the docstring"
pd.to_numeric?

Here's the part that we're looking for:
```
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input
```

Let's try setting the `errors` parameter to `'coerce'`.

In [None]:
small_sales_df.loc[:,'Major'] = pd.to_numeric(sales_df['Major'], errors='coerce')

Did it work?

In [None]:
small_sales_df.info()

It worked! Let's do the same thing with the `Minor` parcel number.

In [None]:
small_sales_df['Minor'] = pd.to_numeric(sales_df['Minor'], errors='coerce')

In [None]:
small_sales_df.info()

Now, let's try our join again.

In [None]:
sales_data = pd.merge(small_sales_df, small_bldg_df, on=['Major', 'Minor'])

In [None]:
sales_data.head()

In [None]:
sales_data.info()

We can see right away that we're missing zip codes for many of the sales transactions. (1321536 non-null entries for ZipCode is fewer than the 1436772 entries in the dataframe.) 

In [None]:
sales_data.loc[sales_data['ZipCode'].isna()].head(20)

Because we are interested in finding houses in Seattle zip codes, we will need to drop the rows with missing zip codes.

In [None]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]

In [None]:
sales_data.info()

# Your turn: Data Cleaning with Pandas

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

In [None]:
sales_data = sales_data.loc[(sales_data['SalePrice'] > 0) & sales_data['SqFtTotLiving'] > 0]
sales_data.head()

### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [None]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

pd.to_numeric(sales_data['ZipCode'])
# sales_data.ZipCode.value_counts()

### 3. Add a column for PricePerSqFt



### 4. Subset the data to 2019 sales only.

We can assume that the DocumentDate is approximately the sale date.

### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

### 6. What is the mean price per square foot for a house sold in Seattle in 2019?

Don't just type the answer. Type code that generates the answer as output!