Roshni Scratch Notebook

In [2]:
# Import Packages and Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import statsmodels.api as sm
import seaborn as sns
from scipy import stats

%matplotlib inline

data = pd.read_csv('data/kc_house_data.csv')

# Preliminary Exploration of Data

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30155 entries, 0 to 30154
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             30155 non-null  int64  
 1   date           30155 non-null  object 
 2   price          30155 non-null  float64
 3   bedrooms       30155 non-null  int64  
 4   bathrooms      30155 non-null  float64
 5   sqft_living    30155 non-null  int64  
 6   sqft_lot       30155 non-null  int64  
 7   floors         30155 non-null  float64
 8   waterfront     30155 non-null  object 
 9   greenbelt      30155 non-null  object 
 10  nuisance       30155 non-null  object 
 11  view           30155 non-null  object 
 12  condition      30155 non-null  object 
 13  grade          30155 non-null  object 
 14  heat_source    30123 non-null  object 
 15  sewer_system   30141 non-null  object 
 16  sqft_above     30155 non-null  int64  
 17  sqft_basement  30155 non-null  int64  
 18  sqft_g

* Definitely unnecessary columns = id, date (sold)
* Maybe unnecessary = address, lat, long
* Not that interesting maybe: heat_source, sewer_system, Grade
* Sub-variables: sqft_above, sqft_basement


## Variable Breakdown: 

### Outcome:
* price (continuous)
***

### Continuous Predictors: 
1. Sqft_living
2. Sqft_lot
3. Sqft_garage
4. Year Built (1900 - 2021; range = 68)


### Categorical Predictors:
#### Categorical Ordinal Predictors: 
1. Bedrooms (13 categories; can combine 6 - 12 as 6+ for 7 categories)
2. Bathrooms (21 categories: combine 4+ into one variable for 7 categories)
3. Floors (7 categories; can combine 3+ for 5 categories)
4. View (BINARY combo possible. 5 categories)
5. Condition (5 categories)

#### Categorical Binary Predictors:
1. Waterfront (discrete: yes/no)
2. Greenbelt (discrete: yes/no)
3. Nuisance (discrete: yes/no)
4. Year Renovated (NEED TO CONVERT, discrete: yes/no)

In [45]:
df = data.drop(labels = ['id', 'date', 'address', 'lat', 'long', 'heat_source', 'sewer_system', 'sqft_above', 'sqft_basement', ], axis=1)

In [48]:
df.bedrooms.value_counts()

3     12754
4      9597
2      3936
5      2798
6       498
1       391
7        80
0        44
8        38
9        14
10        3
13        1
11        1
Name: bedrooms, dtype: int64

***

# Data Cleaning Steps:

## 1. Create new variables:
 * a. sqft_unused (living - lot), drop sqft_lot
 * b. basement (binary - y/n)
 * c. year renovated (binary - y/n)
 * d. bathroom : bedroom ratio

## 2. Condense variables:
 * a. bedrooms - 