<h1 style="color: #00BFFF;">00 |</h1>

In [1]:
# 📚 Basic libraries
import pandas as pd # data manipulation
import numpy as np # numerical operations
import os # file managment
import matplotlib.pyplot as plt # 2D visualizations
import seaborn as sns # high-resolution visualization
import warnings # warning messages management
import datetime # to play with dates

# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
warnings.filterwarnings('ignore') # ignore warnings

In [2]:
# Basic functions

def snake_columns(data): # snake_case columns
    data.columns = [column.lower().replace(' ', '_') for column in data.columns]
    
def open_data(data): # returns shape, data types & shows a small sample
    print(f"Data shape is {data.shape}.")
    print()
    print(data.dtypes)
    print()
    print("Data row sample and full columns:")
    return data.sample(5)

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In [3]:
file_path = os.path.join("C:/Users/apisi/01. IronData/01. GitHub/03. Projects/05_patern_pending/00_data", "regression_data.xls")
data = pd.read_excel(file_path, index_col=0) # to deal with an error `Unnamed: 0` column

<h2 style="color: #008080;">Data Copy</h2>

In [4]:
datac = data.copy() # copy before applying any change, as best practices

In [5]:
snake_columns(datac)
open_data(datac)

Data shape is (21597, 20).

date             datetime64[ns]
bedrooms                  int64
bathrooms               float64
sqft_living               int64
sqft_lot                  int64
floors                  float64
waterfront                int64
view                      int64
condition                 int64
grade                     int64
sqft_above                int64
sqft_basement             int64
yr_built                  int64
yr_renovated              int64
zipcode                   int64
lat                     float64
long                    float64
sqft_living15             int64
sqft_lot15                int64
price                     int64
dtype: object

Data row sample and full columns:


Unnamed: 0_level_0,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7518505070,2014-06-25,4,2.25,2000,3672,2.0,0,0,5,7,1650,350,1926,0,98117,47.6769,-122.383,2000,5100,402000
7715801030,2015-03-31,4,2.5,1620,8125,2.0,0,0,4,7,1620,0,1983,0,98074,47.6255,-122.059,1480,8120,510000
6131600255,2014-12-22,3,2.0,1540,8316,1.0,0,0,5,6,1540,0,1954,0,98002,47.323,-122.216,1250,8316,202500
1421079007,2015-03-24,3,2.75,2480,209199,1.5,0,0,3,8,1870,610,2000,0,98010,47.3085,-121.888,2040,219229,408506
567000020,2015-04-28,2,1.0,1570,5000,1.5,0,3,4,8,1570,0,1924,0,98144,47.5955,-122.294,1760,3000,800000


<blockquote style="background-color: #d4edda; color: #155724; border-color: #c3e6cb; padding: 10px; border-radius: 5px;">
    
**First impression:**
    
_____________

💻 The following is a collection of **one-year data** (from May 2014 - May 2015) of house sale prices for King County, which includes Seattle, among 21 different columns:    
  
<table border="1">
  <tr>
    <th>Column Name</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>id</td>
    <td>A notation for a house</td>
  </tr>
  <tr>
    <td>date</td>
    <td>Date house was sold</td>
  </tr>
  <tr>
    <td>price</td>
    <td>Price is prediction target</td>
  </tr>
  <tr>
    <td>bedrooms</td>
    <td>Number of Bedrooms/House</td>
  </tr>
  <tr>
    <td>bathrooms</td>
    <td>Number of bathrooms/bedrooms</td>
  </tr>
  <tr>
    <td>sqft_living</td>
    <td>Square footage of the home</td>
  </tr>
  <tr>
    <td>sqft_lot</td>
    <td>Square footage of the lot</td>
  </tr>
  <tr>
    <td>floors</td>
    <td>Total floors (levels) in house</td>
  </tr>
  <tr>
    <td>waterfront</td>
    <td>House which has a view to a waterfront</td>
  </tr>
  <tr>
    <td>view</td>
    <td>Has been viewed</td>
  </tr>
  <tr>
    <td>condition</td>
    <td>How good the condition is Overall</td>
  </tr>
  <tr>
    <td>grade</td>
    <td>Overall grade given to the housing unit, based on King County grading system</td>
  </tr>
  <tr>
    <td>sqft_above</td>
    <td>Square footage of house apart from basement</td>
  </tr>
  <tr>
    <td>sqft_basement</td>
    <td>Square footage of the basement</td>
  </tr>
  <tr>
    <td>yr_built</td>
    <td>Built Year</td>
  </tr>
  <tr>
    <td>yr_renovated</td>
    <td>Year when house was renovated</td>
  </tr>
  <tr>
    <td>zipcode</td>
    <td>Zip code</td>
  </tr>
  <tr>
    <td>lat</td>
    <td>Latitude coordinate</td>
  </tr>
  <tr>
    <td>long</td>
    <td>Longitude coordinate</td>
  </tr>
  <tr>
    <td>sqft_living15</td>
    <td>Living room area in 2015</td>
  </tr>
  <tr>
    <td>sqft_lot15</td>
    <td>LotSize area in 2015</td>
  </tr>
</table>
    
* --> **Target variable**: Price.
* --> **Features**:

_____________
</blockquote>

In [6]:
# Moving on to Data Cleaning >
datac.to_csv("C:/Users/apisi/01. IronData/01. GitHub/03. Projects/05_patern_pending/00_data/datac.csv")