# Data Cleaning with Pandas

## Scenario

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

In [2]:
import pandas as pd

### Learning Goals:

- practice cleaning a real dataset
- practice using `py` files and the terminal in conjunction with jupyter notebooks
- run a `py` script through the terminal

### First, get the data!

When working on a project involving data that can fit on our computer, we store it in a `data` directory.

```bash
cd <project_directory>  # example: cd ~/flatiron_ds/pandas-3
mkdir data
cd data
```

Note that `<project_directory>` in angle brackets is a _placeholder_. You should type the path to the actual location on your computer where you're working on this project. Do not literally type `<project_directory>` and _do not type the angle brackets_. You can see an example in the _comment_ to the right of the command above.

![terminal](https://media3.giphy.com/media/yR4xZagT71AAM/giphy.gif?cid=790b76115d3620444553533759086a54&rid=giphy.gif)

Now, we'll need to download the two data files that we need. We can do this at the command line:

```bash
wget https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
wget https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip
```

*Note:* If you do not have the `wget` command yet, you can install it with `brew install wget`, or use `curl <url> -o <filename>`.

Note that `%20` in a URL translates into a space. Even though you will *never put spaces in filenames*, you may need to deal with spaces that _other_ people have used in filenames.

There are two ways to handle the spaces in these filenames when referencing them at the command line.


![internetgif](https://media2.giphy.com/media/QWkuGmMgphvmE/giphy.gif?cid=790b76115d361f42304a6850369f37ea&rid=giphy.gif)

In [None]:
# curl (entire file name) -o (what we'll call the file)

#### 1. You can _escape_ the spaces by putting a backslash (`\`, remember _backslash is next to backspace_) before each one:

`unzip Real\ Property\ Sales.zip`

This is what happens if you tab-complete the filename in the terminal. Tab completion is your friend!

#### 2. You can put the entire filename in quotes:

`unzip "Real Property Sales.zip"`

Try unzipping these files with the `unzip` command. The `unzip` command takes one argument, the name of the file that you want to unzip.

In [3]:
sales_df = pd.read_csv('data/EXTR_RPSale.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
bldg_df = pd.read_csv('data/EXTR_ResBldg.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [6]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2035819 entries, 0 to 2035818
Data columns (total 24 columns):
ExciseTaxNbr          int64
Major                 object
Minor                 object
DocumentDate          object
SalePrice             int64
RecordingNbr          object
Volume                object
Page                  object
PlatNbr               object
PlatType              object
PlatLot               object
PlatBlock             object
SellerName            object
BuyerName             object
PropertyType          int64
PrincipalUse          int64
SaleInstrument        int64
AFForestLand          object
AFCurrentUseLand      object
AFNonProfitUse        object
AFHistoricProperty    object
SaleReason            int64
PropertyClass         int64
dtypes: int64(7), object(17)
memory usage: 372.8+ MB


### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [14]:
sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']]
#order you call these is the order in which they will appear

In [26]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2035819 entries, 0 to 2035818
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 62.1+ MB


In [None]:
bldg_df = pd.read_csv('data/Residential Building.zip')

### Another warning! Which column has index 11?

In [16]:
bldg_df.columns[11]

'PlatBlock'

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [28]:
bldg_df.columns

Index(['Major', 'Minor', 'BldgNbr', 'NbrLivingUnits', 'Address',
       'BuildingNumber', 'Fraction', 'DirectionPrefix', 'StreetName',
       'StreetType', 'DirectionSuffix', 'ZipCode', 'Stories', 'BldgGrade',
       'BldgGradeVar', 'SqFt1stFloor', 'SqFtHalfFloor', 'SqFt2ndFloor',
       'SqFtUpperFloor', 'SqFtUnfinFull', 'SqFtUnfinHalf', 'SqFtTotLiving',
       'SqFtTotBasement', 'SqFtFinBasement', 'FinBasementGrade',
       'SqFtGarageBasement', 'SqFtGarageAttached', 'DaylightBasement',
       'SqFtOpenPorch', 'SqFtEnclosedPorch', 'SqFtDeck', 'HeatSystem',
       'HeatSource', 'BrickStone', 'ViewUtilization', 'Bedrooms',
       'BathHalfCount', 'Bath3qtrCount', 'BathFullCount', 'FpSingleStory',
       'FpMultiStory', 'FpFreestanding', 'FpAdditional', 'YrBuilt',
       'YrRenovated', 'PcntComplete', 'Obsolescence', 'PcntNetCondition',
       'Condition', 'AddnlCost'],
      dtype='object')

### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [15]:
bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']]

In [30]:
bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 512827 entries, 0 to 512826
Data columns (total 4 columns):
Major            512827 non-null int64
Minor            512827 non-null int64
SqFtTotLiving    512827 non-null int64
ZipCode          468452 non-null object
dtypes: int64(3), object(1)
memory usage: 15.7+ MB


In [31]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

### Error!

Why are we seeing an error when we try to join the dataframes?

<table>
    <tr>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013160 entries, 0 to 2013159
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.4+ MB</pre></td>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511359 entries, 0 to 511358
Data columns (total 4 columns):
Major            511359 non-null int64
Minor            511359 non-null int64
SqFtTotLiving    511359 non-null int64
ZipCode          468345 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB
</pre></td>
    </tr>
</table>

Review the error message in light of the above:

* `ValueError: You are trying to merge on object and int64 columns.`

In [32]:
sales_df['Major'] = pd.to_numeric(sales_df['Major'])
# trying to change string to numeric

ValueError: Unable to parse string "      " at position 940671

### Error!

Note the useful error message above:

`ValueError: Unable to parse string "      " at position 936643`

In this case, we want to treat non-numeric values as missing values. Let's see if there's a way to change how the `pd.to_numeric` function handles errors.

In [33]:
# The single question mark means "show me the docstring"
pd.to_numeric?

Here's the part that we're looking for:
```
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input
```

Let's try setting the `errors` parameter to `'coerce'`.

In [16]:
sales_df['Major'] = pd.to_numeric(sales_df['Major'], errors='coerce')
# switching to numeric, setting response for error

Did it work?

In [35]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2035819 entries, 0 to 2035818
Data columns (total 4 columns):
Major           float64
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: float64(1), int64(1), object(2)
memory usage: 62.1+ MB


It worked! Let's do the same thing with the `Minor` parcel number.

In [17]:
sales_df['Minor'] = pd.to_numeric(sales_df['Minor'], errors='coerce')

In [37]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2035819 entries, 0 to 2035818
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 62.1+ MB


Now, let's try our join again.

In [18]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [39]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [40]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1453814 entries, 0 to 1453813
Data columns (total 6 columns):
Major            1453814 non-null float64
Minor            1453814 non-null float64
DocumentDate     1453814 non-null object
SalePrice        1453814 non-null int64
SqFtTotLiving    1453814 non-null int64
ZipCode          1334823 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 77.6+ MB


We can see right away that we're missing zip codes for many of the sales transactions. (1321536 non-null entries for ZipCode is fewer than the 1436772 entries in the dataframe.) 

In [19]:
sales_data.loc[sales_data['ZipCode'].isna()].head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
91,717370.0,350.0,12/01/1997,0,3380,
92,717370.0,350.0,09/13/2004,300000,3380,
93,717370.0,350.0,02/06/2006,901000,3380,
110,277110.0,1923.0,02/08/2007,372500,1000,
111,277110.0,1923.0,02/08/2007,0,1000,


Because we are interested in finding houses in Seattle zip codes, we will need to drop the rows with missing zip codes.

In [20]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]
# ~ means drop
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


# Your turn: Data Cleaning with Pandas
![turtletype](https://media3.giphy.com/media/cFdHXXm5GhJsc/giphy.gif?cid=790b76115d3627d8354c7179366b0672&rid=giphy.gif)

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

In [21]:
sales_data = sales_data.loc[sales_data['SalePrice'] > 0]

In [22]:
sales_data.loc[sales_data['SqFtTotLiving'] < 50]

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
21108,857600.0,350.0,12/13/2017,2025000,2,98010
209868,156310.0,594.0,04/16/1996,169950,2,98116
209870,156310.0,594.0,06/29/1998,251500,2,98116
209871,156310.0,594.0,08/10/2000,245000,2,98116
209872,156310.0,594.0,05/05/1992,156500,2,98116
209873,156310.0,594.0,07/17/1998,251500,2,98116
382863,941240.0,165.0,10/28/2011,1000000,0,98118
382865,941240.0,165.0,07/16/2014,1300000,0,98118
382867,941240.0,165.0,09/19/2000,54440,0,98118
382869,941240.0,165.0,11/09/2011,1515000,0,98118


In [23]:
sales_data = sales_data.loc[sales_data['SqFtTotLiving'] > 50]

In [70]:
sales_data_copy = sales_data

In [71]:
sales_data = sales_cleaned

### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [24]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

13522    98031-3173
13842    98042-3001
13843    98042-3001
13844    98042-3001
13845    98042-3001
Name: ZipCode, dtype: object

In [122]:
sales_data.ZipCode = sales_data.ZipCode.str[:5]

In [75]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
6,423943.0,50.0,07/15/1999,96000,960,98092
7,423943.0,50.0,01/08/2001,127500,960,98092


### 3. Add a column for PricePerSqFt



In [77]:
sales_data['PricePerSqFt'] = sales_data['SalePrice']/sales_data['SqFtTotLiving']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [78]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
0,138860.0,110.0,08/21/2014,245000,1490,98002,164.42953
1,138860.0,110.0,06/12/1989,109300,1490,98002,73.355705
2,138860.0,110.0,01/16/2005,14684,1490,98002,9.855034
6,423943.0,50.0,07/15/1999,96000,960,98092,100.0
7,423943.0,50.0,01/08/2001,127500,960,98092,132.8125


### 4. Subset the data to 2019 sales only.

We can assume that the DocumentDate is approximately the sale date.

In [None]:
sales_data['date'] = pd.to_datetime(sales_data.DocumentDate)
sales_data2019 = sales_data[sales_data.date.dt.year == 2019]

In [96]:
data2019 = sales_data.loc[sales_data.DocumentDate.str[-4:] == '2019']

### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

In [81]:
SeattleZipCodes = [8101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98164, 98174, 98177, 98178, 98195, 98199]

In [105]:
seattle2019 = data2019.loc[data2019['ZipCode'].isin(SeattleZipCodes)]

In [94]:
seattle2019['ZipCode'] = pd.to_numeric(seattle2019['ZipCode'])

In [102]:
data2019['ZipCode'] = pd.to_numeric(data2019['ZipCode'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [103]:
data2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12642 entries, 240 to 1453638
Data columns (total 7 columns):
Major            12642 non-null float64
Minor            12642 non-null float64
DocumentDate     12642 non-null object
SalePrice        12642 non-null int64
SqFtTotLiving    12642 non-null int64
ZipCode          12154 non-null float64
PricePerSqFt     12642 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 790.1+ KB


In [95]:
seattle2019

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt


In [None]:
zipseries = pd.Series(Zipcodes)

In [None]:
test

In [None]:
read

### 6. What is the mean price per square foot for a house sold in Seattle in 2019?

Don't just type the answer. Type code that generates the answer as output!

In [109]:
seattle2019['PricePerSqFt'].mean()

520.5525738366053

## Turning code into a script

#### make a new .py file
- open a new `.py` file _or_ open a new jupyter notbook and export as a `.py` file so we can start to edit the `.py` file directly
- save the file as `mean_ppsf_seattle.py`
- look at all your code between `sales_df = pd.read_csv('data/Real Property Sales.zip')` and question `number 6` above

#### review & organize your code
- _organize_ your code in the `mean_ppsf_seattle.py` to start with `sales_df = pd.read_csv('data/Real Property Sales.zip')` and end with printing out the mean price per square foot for a house sold in seattle in 2019
- the code should be able to run without throwing any errors
- remember to include `import pandas as pd` and any other necessary statements at the start of your script

#### test your script
- go to the terminal
- make sure you are in the same directory path as your jupyter notebook and the `.py` file
- in the terminal type and then run `python mean_ppsf_seattle.py`
- confirm the script returns in terminal what you wanted it to return

#### send your script to Ammar and Marisa

![anykey](https://media2.giphy.com/media/26BGIqWh2R1fi6JDa/giphy.gif?cid=790b76115d3627d8354c7179366b0672&rid=giphy.gif)

In [4]:
sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']]

In [5]:
bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']]

In [6]:
sales_df['Major'] = pd.to_numeric(sales_df['Major'], errors='coerce')

In [7]:
sales_df['Minor'] = pd.to_numeric(sales_df['Minor'], errors='coerce')

In [8]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [9]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [10]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]

In [11]:
sales_data = sales_data.loc[sales_data['SalePrice'] > 0]

In [12]:
sales_data = sales_data.loc[sales_data['SqFtTotLiving'] > 50]

In [13]:
sales_data['PricePerSqFt'] = sales_data['SalePrice']/sales_data['SqFtTotLiving']

In [27]:
sales_data['Postal'] = sales_data.ZipCode.map(lambda x: str(x)[:5])

In [29]:
sales_data['Postal'] = pd.to_numeric(sales_data['Postal'], errors = 'coerce')

In [65]:
sales_data1.ZipCode = sales_data1.ZipCode.str[:5]

In [66]:
sales_data['ZipCode']

0          98002
1          98002
2          98002
6          98092
7          98092
11         98008
12         98008
13         98008
14         98008
16         98058
17         98058
19         98038
20         98038
21         98038
22         98038
23         98038
25           NaN
26           NaN
27           NaN
29         98058
31         98058
32         98058
33         98058
34         98188
36         98188
38         98051
39         98051
40         98001
42         98001
44         98001
           ...  
1453773    98002
1453775    98118
1453776    98177
1453778    98011
1453779    98107
1453780    98029
1453781    98002
1453782    98003
1453784    98003
1453785    98034
1453786    98028
1453787    98092
1453788    98126
1453789    98168
1453790    98003
1453791    98028
1453792    98053
1453793    98125
1453796    98117
1453798    98033
1453799    98077
1453800    98027
1453801    98075
1453802    98029
1453803    98003
1453805    98092
1453806    98155
1453809    980

In [30]:
data2019 = sales_data.loc[sales_data.DocumentDate.str[-4:] == '2019']

In [31]:
SeattleZipCodes = [8101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98164, 98174, 98177, 98178, 98195, 98199]

In [63]:
type(SeattleZipCodes[0])

int

In [32]:
seattle2019 = data2019.loc[data2019['Postal'].isin(SeattleZipCodes)]
print(seattle2019['PricePerSqFt'].mean())

520.2328898928004


In [25]:
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

13522    98031-3173
13842    98042-3001
13843    98042-3001
13844    98042-3001
13845    98042-3001
Name: ZipCode, dtype: object

In [38]:
sales_data1 = sales_data

In [43]:
sales_data1['ZipCode'].isna().count()

876235

In [42]:
sales_data1.ZipCode = sales_data1.ZipCode.str[:5]

In [35]:
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

sales_data1.loc[sales_data1['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

25     NaN
26     NaN
27     NaN
237    NaN
238    NaN
Name: ZipCode, dtype: object

In [44]:
sales_data1['ZipCode'] = pd.to_numeric(sales_data1['ZipCode'])

ValueError: Unable to parse string " " at position 146480