# Data Cleaning with Pandas

In [2]:
import pandas as pd

### Learning Goals:

- practice cleaning a real dataset
- practice using `py` files and the terminal in conjunction with jupyter notebooks
- run a `py` script through the terminal

## Scenario

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

When working on a project involving data that can fit on our computer, we store it in a `data` directory.

```bash
cd <project_directory>  # example: cd ~/flatiron_ds/pandas-3
mkdir data
cd data
```

Note that `<project_directory>` in angle brackets is a _placeholder_. You should type the path to the actual location on your computer where you're working on this project. Do not literally type `<project_directory>` and _do not type the angle brackets_. You can see an example in the _comment_ to the right of the command above.

![terminal](https://media3.giphy.com/media/yR4xZagT71AAM/giphy.gif?cid=790b76115d3620444553533759086a54&rid=giphy.gif)

Now, we'll need to download the two data files that we need. We can do this at the command line:

```bash
wget https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
wget https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip
```

*Note:* If you do not have the `wget` command yet, you can install it with `brew install wget`, or use `curl <url> -o <filename>`.

Note that `%20` in a URL translates into a space. Even though you will *never put spaces in filenames*, you may need to deal with spaces that _other_ people have used in filenames.

There are two ways to handle the spaces in these filenames when referencing them at the command line.
![internetgif](https://media2.giphy.com/media/QWkuGmMgphvmE/giphy.gif?cid=790b76115d361f42304a6850369f37ea&rid=giphy.gif)

#### 1. You can _escape_ the spaces by putting a backslash (`\`, remember _backslash is next to backspace_) before each one:

`unzip Real\ Property\ Sales.zip`

This is what happens if you tab-complete the filename in the terminal. Tab completion is your friend!

#### 2. You can put the entire filename in quotes:

`unzip "Real Property Sales.zip"`

Try unzipping these files with the `unzip` command. The `unzip` command takes one argument, the name of the file that you want to unzip.

In [3]:
sales_df = pd.read_csv('EXTR_RPSale.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [4]:
sales_df.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,2687551,138860,110,08/21/2014,245000,20140828001436,,,,,...,3,6,3,N,N,N,N,1,8,
1,1235111,664885,40,07/09/1991,0,199203161090,71.0,1.0,664885.0,C,...,3,0,26,N,N,N,N,18,3,11
2,2704079,423943,50,10/11/2014,0,20141205000558,,,,,...,3,6,15,N,N,N,N,18,8,18 31 51
3,2584094,403700,715,01/04/2013,0,20130110000910,,,,,...,3,6,15,N,N,N,N,11,8,18 31 38
4,1056831,951120,900,04/20/1989,85000,198904260448,117.0,53.0,951120.0,P,...,3,0,2,N,N,N,N,1,9,49


In [5]:
sales_df.Major.describe()

count     2037951
unique      28599
top             0
freq         9363
Name: Major, dtype: int64

In [6]:
sales_df.Minor.describe()

count     2037951
unique      11306
top            20
freq        36583
Name: Minor, dtype: int64

In [7]:
sales_df.shape

(2037951, 24)

### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [8]:
sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']]

In [9]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2037951 entries, 0 to 2037950
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 62.2+ MB


In [10]:
sales_df.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice
0,138860,110,08/21/2014,245000
1,664885,40,07/09/1991,0
2,423943,50,10/11/2014,0
3,403700,715,01/04/2013,0
4,951120,900,04/20/1989,85000


In [11]:
bldg_df = pd.read_csv('EXTR_ResBldg.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Another warning! Which column has index 11?

In [12]:
bldg_df.columns[11]

'ZipCode'

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [13]:
bldg_df.head()

Unnamed: 0,Major,Minor,BldgNbr,NbrLivingUnits,Address,BuildingNumber,Fraction,DirectionPrefix,StreetName,StreetType,...,FpMultiStory,FpFreestanding,FpAdditional,YrBuilt,YrRenovated,PcntComplete,Obsolescence,PcntNetCondition,Condition,AddnlCost
0,4000,398,1,1,14415 45TH LN S,14415,,,45TH,LN,...,0,0,0,2018,0,0,0,0,3,0
1,4000,460,1,1,4226 S 146TH ST 98168,4226,,S,146TH,ST,...,0,0,0,1993,0,0,0,0,3,5000
2,4000,635,1,1,4831 S 146TH ST 98168,4831,,S,146TH,ST,...,0,0,0,1943,0,0,0,0,3,0
3,4000,736,1,1,4429 S 146TH ST 98168,4429,,S,146TH,ST,...,0,0,0,1960,2012,0,0,0,3,0
4,4100,230,1,1,14835 42ND AVE S 98168,14835,,,42ND,AVE,...,0,0,0,1962,0,0,0,0,3,0


In [14]:
bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513242 entries, 0 to 513241
Data columns (total 50 columns):
Major                 513242 non-null int64
Minor                 513242 non-null int64
BldgNbr               513242 non-null int64
NbrLivingUnits        513242 non-null int64
Address               513242 non-null object
BuildingNumber        513242 non-null object
Fraction              513242 non-null object
DirectionPrefix       512680 non-null object
StreetName            513242 non-null object
StreetType            513242 non-null object
DirectionSuffix       512680 non-null object
ZipCode               468472 non-null object
Stories               513242 non-null float64
BldgGrade             513242 non-null int64
BldgGradeVar          513242 non-null int64
SqFt1stFloor          513242 non-null int64
SqFtHalfFloor         513242 non-null int64
SqFt2ndFloor          513242 non-null int64
SqFtUpperFloor        513242 non-null int64
SqFtUnfinFull         513242 non-null int64

### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [15]:
bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']]

In [16]:
bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513242 entries, 0 to 513241
Data columns (total 4 columns):
Major            513242 non-null int64
Minor            513242 non-null int64
SqFtTotLiving    513242 non-null int64
ZipCode          468472 non-null object
dtypes: int64(3), object(1)
memory usage: 15.7+ MB


In [17]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [18]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860,110,08/21/2014,245000,1490,98002
1,138860,110,06/12/1989,109300,1490,98002
2,138860,110,01/16/2005,14684,1490,98002
3,138860,110,06/08/2005,0,1490,98002
4,423943,50,10/11/2014,0,960,98092


### Error!

Why are we seeing an error when we try to join the dataframes?

<table>
    <tr>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013160 entries, 0 to 2013159
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.4+ MB</pre></td>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511359 entries, 0 to 511358
Data columns (total 4 columns):
Major            511359 non-null int64
Minor            511359 non-null int64
SqFtTotLiving    511359 non-null int64
ZipCode          468345 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB
</pre></td>
    </tr>
</table>

Review the error message in light of the above:

* `ValueError: You are trying to merge on object and int64 columns.`

In [19]:
# sales_df['Major'] = pd.to_numeric(sales_df['Major'])

### Error!

Note the useful error message above:

`ValueError: Unable to parse string "      " at position 936643`

In this case, we want to treat non-numeric values as missing values. Let's see if there's a way to change how the `pd.to_numeric` function handles errors.

In [20]:
# The single question mark means "show me the docstring"
pd.to_numeric?

[0;31mSignature:[0m [0mpd[0m[0;34m.[0m[0mto_numeric[0m[0;34m([0m[0marg[0m[0;34m,[0m [0merrors[0m[0;34m=[0m[0;34m'raise'[0m[0;34m,[0m [0mdowncast[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Convert argument to a numeric type.

The default return dtype is `float64` or `int64`
depending on the data supplied. Use the `downcast` parameter
to obtain other dtypes.

Please note that precision loss may occur if really large numbers
are passed in. Due to the internal limitations of `ndarray`, if
numbers smaller than `-9223372036854775808` (np.iinfo(np.int64).min)
or larger than `18446744073709551615` (np.iinfo(np.uint64).max) are
passed in, it is very likely they will be converted to float so that
`Series` since it internally leverages `ndarray`.

Parameters
----------
arg : scalar, list, tuple, 1-d array, or Series
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    

Here's the part that we're looking for:
```
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input
```

Let's try setting the `errors` parameter to `'coerce'`.

In [21]:
sales_df['Major'] = pd.to_numeric(sales_df['Major'], errors='coerce')

Did it work?

In [22]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2037951 entries, 0 to 2037950
Data columns (total 4 columns):
Major           float64
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: float64(1), int64(1), object(2)
memory usage: 62.2+ MB


It worked! Let's do the same thing with the `Minor` parcel number.

In [23]:
sales_df['Minor'] = pd.to_numeric(sales_df['Minor'], errors='coerce')

In [24]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2037951 entries, 0 to 2037950
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 62.2+ MB


Now, let's try our join again.

In [25]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [26]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [27]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1455450 entries, 0 to 1455449
Data columns (total 6 columns):
Major            1455450 non-null float64
Minor            1455450 non-null float64
DocumentDate     1455450 non-null object
SalePrice        1455450 non-null int64
SqFtTotLiving    1455450 non-null int64
ZipCode          1335603 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 77.7+ MB


We can see right away that we're missing zip codes for many of the sales transactions. (1321536 non-null entries for ZipCode is fewer than the 1436772 entries in the dataframe.) 

In [28]:
sales_data.loc[sales_data['ZipCode'].isna()].head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
23,717370.0,350.0,12/01/1997,0,3380,
24,717370.0,350.0,09/13/2004,300000,3380,
25,717370.0,350.0,02/06/2006,901000,3380,
42,277110.0,1923.0,02/08/2007,372500,1000,
43,277110.0,1923.0,02/08/2007,0,1000,


Because we are interested in finding houses in Seattle zip codes, we will need to drop the rows with missing zip codes.

In [29]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]

sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [30]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1335603 entries, 0 to 1455449
Data columns (total 6 columns):
Major            1335603 non-null float64
Minor            1335603 non-null float64
DocumentDate     1335603 non-null object
SalePrice        1335603 non-null int64
SqFtTotLiving    1335603 non-null int64
ZipCode          1335603 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 71.3+ MB


In [31]:
sales_data.describe()

Unnamed: 0,Major,Minor,SalePrice,SqFtTotLiving
count,1335603.0,1335603.0,1335603.0,1335603.0
mean,447836.7,1587.396,285565.5,2090.818
std,286181.2,2871.77,747004.6,967.4067
min,40.0,1.0,-600.0,0.0
25%,202107.0,116.0,0.0,1430.0
50%,385690.0,330.0,160000.0,1920.0
75%,722750.0,1040.0,349950.0,2550.0
max,990600.0,9689.0,37500000.0,48160.0


# Your turn: Data Cleaning with Pandas
![turtletype](https://media3.giphy.com/media/cFdHXXm5GhJsc/giphy.gif?cid=790b76115d3627d8354c7179366b0672&rid=giphy.gif)

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

In [32]:
sales_data['SalePrice'].value_counts()

0         458804
250000      4770
300000      4629
350000      4442
200000      4111
           ...  
369410         1
575270         1
399100         1
407288         1
24564          1
Name: SalePrice, Length: 72074, dtype: int64

In [33]:
sales_data['SalePrice'].describe()

count    1.335603e+06
mean     2.855655e+05
std      7.470046e+05
min     -6.000000e+02
25%      0.000000e+00
50%      1.600000e+05
75%      3.499500e+05
max      3.750000e+07
Name: SalePrice, dtype: float64

In [72]:
clean_sales_data = sales_data.loc[(sales_data['SalePrice'] > 0) & (sales_data['SqFtTotLiving'] > 0)]
clean_sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
6,423943.0,50.0,07/15/1999,96000,960,98092
7,423943.0,50.0,01/08/2001,127500,960,98092


In [73]:
clean_sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 876788 entries, 0 to 1455449
Data columns (total 6 columns):
Major            876788 non-null float64
Minor            876788 non-null float64
DocumentDate     876788 non-null object
SalePrice        876788 non-null int64
SqFtTotLiving    876788 non-null int64
ZipCode          876788 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 46.8+ MB


In [74]:
clean_sales_data['SalePrice'].describe()

count    8.767880e+05
mean     4.349939e+05
std      8.860111e+05
min      1.000000e+00
25%      1.650000e+05
50%      2.790000e+05
75%      4.590000e+05
max      3.750000e+07
Name: SalePrice, dtype: float64

In [75]:
clean_sales_data['SqFtTotLiving'].describe()

count    876788.000000
mean       2105.909525
std         949.954601
min           1.000000
25%        1450.000000
50%        1950.000000
75%        2570.000000
max       48160.000000
Name: SqFtTotLiving, dtype: float64

### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [76]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

In [77]:
clean_sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode']

32602      98033-4917
32603      98033-4917
32604      98033-4917
32605      98033-4917
32606      98033-4917
              ...    
1422637    98028-8533
1425542    98028-6100
1428383            WA
1428664    98042-3001
1428665    98042-3001
Name: ZipCode, Length: 178, dtype: object

In [78]:
def fix_zip(z):
    if not is_integer(z):
        return z[:5]
    return z

In [79]:
clean_sales_data['ZipCode'] = clean_sales_data.ZipCode.map(lambda x: str(x)[:5])
clean_sales_data['ZipCode'].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


count     876788
unique       129
top        98042
freq       28430
Name: ZipCode, dtype: object

In [80]:
clean_sales_data.ZipCode

0          98002
1          98002
2          98002
6          98092
7          98092
           ...  
1455442    98155
1455443    98006
1455445    98115
1455448    98001
1455449    98042
Name: ZipCode, Length: 876788, dtype: object

### 3. Add a column for PricePerSqFt



In [81]:
clean_sales_data['PricePerSqFt'] = clean_sales_data.SalePrice / clean_sales_data.SqFtTotLiving
clean_sales_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
0,138860.0,110.0,08/21/2014,245000,1490,98002,164.429530
1,138860.0,110.0,06/12/1989,109300,1490,98002,73.355705
2,138860.0,110.0,01/16/2005,14684,1490,98002,9.855034
6,423943.0,50.0,07/15/1999,96000,960,98092,100.000000
7,423943.0,50.0,01/08/2001,127500,960,98092,132.812500
11,403700.0,715.0,07/03/2013,464500,1780,98008,260.955056
12,403700.0,715.0,02/21/2013,357000,1780,98008,200.561798
13,403700.0,715.0,10/13/1995,142000,1780,98008,79.775281
14,403700.0,715.0,02/22/2007,528000,1780,98008,296.629213
16,721481.0,90.0,01/03/1992,193000,3220,98072,59.937888


### 4. Subset the data to 2019 sales only.

We can assume that the DocumentDate is approximately the sale date.

In [82]:
clean_sales_data_2019 = clean_sales_data.loc[clean_sales_data['DocumentDate'].str[-4:] == '2019']

In [83]:
clean_sales_data_2019.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
368,797320.0,2320.0,03/27/2019,540000,1240,98146,435.483871
439,140281.0,20.0,06/04/2019,450000,1080,98019,416.666667
491,82007.0,9027.0,02/15/2019,895000,3160,98022,283.227848
816,302605.0,9246.0,05/07/2019,1010000,2260,98034,446.902655
919,98400.0,450.0,02/20/2019,409950,1850,98058,221.594595


### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

In [86]:
zipcodes = [98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112,\
            98115, 98116, 98117, 98118, 98119, 98121, 98122, 98125, 98126, 98133,\
            98134, 98136, 98144, 98146, 98154, 98164, 98174, 98177, 98178, 98195, 98199]

zipseries = pd.Series(zipcodes)
test = zipseries.astype(str)
clean_sales_seattle_2019 = clean_sales_data_2019.loc[clean_sales_data_2019.ZipCode.isin(test)]
clean_sales_seattle_2019.shape

(4573, 7)

### 6. What is the mean price per square foot for a house sold in Seattle in 2019?

Don't just type the answer. Type code that generates the answer as output!

In [88]:
clean_sales_seattle_2019.describe()

Unnamed: 0,Major,Minor,SalePrice,SqFtTotLiving,PricePerSqFt
count,4573.0,4573.0,4573.0,4573.0,4573.0
mean,457867.479991,1270.252788,844674.3,1779.790072,515.922394
std,288768.669041,2219.138958,595488.1,787.18417,567.30481
min,140.0,4.0,10.0,280.0,0.003155
25%,212370.0,121.0,560000.0,1230.0,364.465409
50%,395940.0,395.0,727500.0,1630.0,464.516129
75%,727610.0,1243.0,960000.0,2180.0,577.380952
max,990400.0,9651.0,17255000.0,7110.0,30271.929825


## Turning code into a script

#### make a new .py file
- open a new `.py` file _or_ open a new jupyter notbook and export as a `.py` file so we can start to edit the `.py` file directly
- save the file as `mean_ppsf_seattle.py`
- look at all your code between `sales_df = pd.read_csv('data/Real Property Sales.zip')` and question `number 6` above

#### review & organize your code
- _organize_ your code in the `mean_ppsf_seattle.py` to start with `sales_df = pd.read_csv('data/Real Property Sales.zip')` and end with printing out the mean price per square foot for a house sold in seattle in 2019
- the code should be able to run without throwing any errors
- remember to include `import pandas as pd` and any other necessary statements at the start of your script

#### test your script
- go to the terminal
- make sure you are in the same directory path as your jupyter notebook and the `.py` file
- in the terminal type and then run `python mean_ppsf_seattle.py`
- confirm the script returns in terminal what you wanted it to return

#### send your script to Ammar and Marisa

![anykey](https://media2.giphy.com/media/26BGIqWh2R1fi6JDa/giphy.gif?cid=790b76115d3627d8354c7179366b0672&rid=giphy.gif)