# Load, Clean, and Preliminary EDA Housing Data

## Imports

In [1]:
# (this is not an exhaustive list of libraries)
import pandas as pd
# import numpy as np
# import os
# import json
# from pprint import pprint
# from functions_variables import encode_tags

from modules.clean_data import get_nearly_empty_columns

## Load Data

**Delete:**

- The os module has a perfect method to list files in a directory.
- Pandas json normalize could work here but is not necessary to convert the JSON data to a dataframe.
- You may need a nested for-loop to access each sale!
- We've put a lot of time into creating the structure of this repository, and it's a good example for future projects.  In the file functions_variables.py, there is an example function that you can import and use.  If you have any variables, functions or classes that you want to make, they can be put in the functions_variables.py file and imported into a notebook.  Note that only .py files can be imported into a notebook. If you want to import everything from a .py file, you can use the following:
```python
from functions_variables import *
```
If you just import functions_variables, then each object from the file will need to be prepended with "functions_variables"\
Using this .py file will keep your notebooks very organized and make it easier to reuse code between notebooks.

The Python script `modules/load_data.py` was used to load housing data from the JSON files in `data/raw`, generate a DataFrame, and export the DataFrame to a CSV file, `data/processed/housing_data_0.csv`. The script loops through all JSON files, parsing them for information concerning a house's sale price, sale date, description, location, tags, and flags.

Get started by loading `data/processed/housing_data_0.csv`.

In [2]:
dirname = '../data/processed/'
basename = 'housing_data_0.csv'
filename = dirname + basename
df = pd.read_csv(filename, sep=',')
df.shape

  df = pd.read_csv(filename, sep=',')


(8159, 193)

## Clean Data

At this point, ensure that you have all sales in a dataframe.
- Is each cell one value, or do some cells have lists?
- Maybe the "tags" will help create some features.
- What are the data types of each column?
- Some sales may not actually include the sale price.  These rows should be dropped.
- Some sales don't include the property type.
- There are a lot of None values.  Should these be dropped or replaced with something?

The loaded housing DataFrame contains a lot of columns, many of which are filled with missing information. Some of the columns are irrelevant or redundant and some rows are missing a sale price. In this section, we formulate and implement data cleaning steps.

### Identify Nearly-Empty Columns

In [3]:
cols_to_drop = []  # a running list of columns to drop

We identify columns that are at least 95% empty and flag them to be dropped.

In [4]:
cols_to_drop.extend(get_nearly_empty_columns(df))

### Identify Irrelevant/Redundant Columns

We identify columns that are irrelevant to the sale price of a house and flag them to be dropped.

In [5]:
cols_to_drop.extend(['property_id', 'listing_id'])

There are several columns that contain redundant information. We flag them to be dropped.

In [6]:
print(df['status'].nunique())
cols_to_drop.append('status')

1


In [7]:
print(df['is_new_listing'].fillna(False).nunique())
cols_to_drop.append('is_new_listing')

1


In [8]:
cols = ['baths_3qtr', 'baths_full', 'baths_half', 'baths', 'ensuite'] 
print(df[cols].sample(5))

cols_to_drop.extend(['baths_3qtr', 'baths_full', 'baths_half'])

      baths_3qtr  baths_full  baths_half  baths ensuite
6873         NaN         6.0         NaN    6.0     NaN
4542         NaN         1.0         NaN    1.0     NaN
6630         NaN         2.0         NaN    2.0     NaN
3837         NaN         2.0         NaN    2.0     NaN
4503         NaN         2.0         NaN    2.0     NaN


In [9]:
cols = ['garage', 'garage_1_or_more', 'garage_2_or_more', 'garage_3_or_more', 'carport']
print(df[cols].sample(5))

cols_to_drop.extend(['garage_1_or_more', 'garage_2_or_more', 'garage_3_or_more'])

      garage garage_1_or_more garage_2_or_more garage_3_or_more carport
1194     2.0             True             True              NaN     NaN
5587     NaN              NaN              NaN              NaN     NaN
540      NaN              NaN              NaN              NaN     NaN
7003     NaN              NaN              NaN              NaN     NaN
5046     NaN              NaN              NaN              NaN     NaN


In [10]:
cols = ['stories', 'single_story', 'two_or_more_stories']
print(df[cols].sample(5))

cols_to_drop.extend(['single_story', 'two_or_more_stories'])

      stories single_story two_or_more_stories
2298      1.0         True                 NaN
3888      NaN          NaN                 NaN
4283      2.0          NaN                True
2262      1.0         True                 NaN
5179      2.0          NaN                True


In [11]:
cols = ['type', 'sub_type']
print(df[cols].sample(5))

cols_to_drop.append('sub_type')

               type   sub_type
1416  single_family        NaN
4204  single_family        NaN
3818           land        NaN
775          condos      condo
3645      townhomes  townhouse


In [12]:
cols = ['price_reduced_amount', 'is_price_reduced']
df[cols].sample(5)

cols_to_drop.append('is_price_reduced')

We consider geographical data.

In [13]:
cols = ['city', 'state', 'postal_code', 'lat', 'lon']
df[cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   city         8154 non-null   object 
 1   state        8159 non-null   object 
 2   postal_code  8159 non-null   int64  
 3   lat          7909 non-null   float64
 4   lon          7909 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 318.8+ KB


We keep the `city` and `postal_code` columns. 

In [14]:
cols_to_drop.extend(['state', 'lat', 'lon'])

In [15]:
cols = [
    'views', 'view', 'hill_or_mountain_view', 'city_view',
    'big_yard', 'fenced_yard', 'front_porch',
    'groundscare', 'farm', 'ranch', 
]
print(df[cols].fillna(False).sample(5))

cols_to_drop.extend([
    'views', 'hill_or_mountain_view', 
    'city_view', 'big_yard'
])

      views   view  hill_or_mountain_view  city_view  big_yard  fenced_yard  \
376   False  False                  False      False     False        False   
1883  False   True                  False       True     False        False   
4220   True   True                  False      False     False        False   
3805   True   True                   True      False     False        False   
5517  False  False                  False      False     False        False   

      front_porch  groundscare   farm  ranch  
376         False        False  False  False  
1883        False        False  False  False  
4220        False        False  False  False  
3805        False        False  False  False  
5517        False        False  False  False  


#### Drop Irrelevant/Redundant Columns

In [16]:
df = df.drop(columns=cols_to_drop)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 50 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   list_date                    7752 non-null   object 
 1   price_reduced_amount         2484 non-null   float64
 2   year_built                   7316 non-null   float64
 3   sold_date                    8159 non-null   object 
 4   sold_price                   6716 non-null   float64
 5   lot_sqft                     6991 non-null   float64
 6   sqft                         7323 non-null   float64
 7   baths                        7980 non-null   float64
 8   garage                       4448 non-null   float64
 9   stories                      6260 non-null   float64
 10  beds                         7504 non-null   float64
 11  type                         8125 non-null   object 
 12  postal_code                  8159 non-null   int64  
 13  city              

### Identify Rows Without Sales Price

There are rows that are missing a sale price. As house sale price will serve as the target in the upcoming machine learning analysis, we drop the rows in question.

In [22]:
filter = df['sold_price'].isna()
rows_to_drop = df.loc[filter, :].index.to_list()

In [23]:
df.drop(index=rows_to_drop).info()

<class 'pandas.core.frame.DataFrame'>
Index: 6716 entries, 0 to 8158
Data columns (total 50 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   list_date                    6429 non-null   object 
 1   price_reduced_amount         2010 non-null   float64
 2   year_built                   6041 non-null   float64
 3   sold_date                    6716 non-null   object 
 4   sold_price                   6716 non-null   float64
 5   lot_sqft                     5794 non-null   float64
 6   sqft                         6061 non-null   float64
 7   baths                        6566 non-null   float64
 8   garage                       3485 non-null   float64
 9   stories                      5099 non-null   float64
 10  beds                         6223 non-null   float64
 11  type                         6696 non-null   object 
 12  postal_code                  6716 non-null   int64  
 13  city                   

### Identify Imputation Values

### Identify Suitable dtypes

### Implement Cleaning

## Feature Engineering

Consider the fact that with tags, there are a lot of categorical variables.
- How many columns would we have if we OHE tags, city and state?
- Perhaps we can get rid of tags that have a low frequency.

In [17]:
# OHE categorical variables here
# tags will have to be done manually

- Sales will vary drastically between cities and states.  Is there a way to keep information about which city it is without OHE such as using central tendency?
- Could we label encode or ordinal encode?  Yes, but this may have undesirable effects, giving nominal data ordinal values.
- If you replace cities or states with numerical values, make sure that the data is split so that we don't leak data into the training selection. This is a great time to train test split. Compute on the training data, and join these values to the test data
- Drop columns that aren't needed.
- Don't keep the list price because it will be too close to the sale price.

In [18]:
# perform train test split here
# do something with state and city
# drop any other not needed columns

**STRETCH**

- You're not limited to just using the data provided to you. Think/ do some research about other features that might be useful to predict housing prices. 
- Can you import and join this data? Make sure you do any necessary preprocessing and make sure it is joined correctly.
- Example suggestion: could mortgage interest rates in the year of the listing affect the price? 

In [19]:
# import, join and preprocess new data here

Remember all of the EDA that you've been learning about?  Now is a perfect time for it!
- Look at distributions of numerical variables to see the shape of the data and detect outliers.
- Scatterplots of a numerical variable and the target go a long way to show correlations.
- A heatmap will help detect highly correlated features, and we don't want these.
- Is there any overlap in any of the features? (redundant information, like number of this or that room...)

In [20]:
# perform EDA here

Now is a great time to scale the data and save it once it's preprocessed.
- You can save it in your data folder, but you may want to make a new `processed/` subfolder to keep it organized