# Getting started

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
import pandas as pd
pd.__version__

'1.5.3'

In [12]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kc_house_data.csv')

# Overview of the dataset

The dataset, which describes real estate units, contains information about the below properties of each unit, retrieved using the 'columns' property on the DataFrame object. The total number of columns is 21, while there are 21,613 entries. This can be learned through the 'shape' property.

In [13]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [14]:
df.shape

(21613, 21)

The variables contained in the dataset are of three data types: int, float and object (string object). This can be demonstrated by calling the info() method on the DataFrame object.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

# Price distribution

The minimum price listed is 75,000, while the maximum price is 7,700,000. The mean price is 540,182. The most common prices are 450,000 and 350,000, each occurring 172 times and representing two modes. The median house price is 450,000.

In [16]:
df.price.min()

75000.0

In [17]:
df.price.max()

7700000.0

In [18]:
df.price.mean()

540182.1587933188

In [19]:
df.price.value_counts()

450000.0    172
350000.0    172
550000.0    159
500000.0    152
425000.0    150
           ... 
278800.0      1
439888.0      1
354901.0      1
942000.0      1
402101.0      1
Name: price, Length: 3625, dtype: int64

In [20]:
price_list = df.price.values
price_list.sort()
median_index = int(len(price_list) / 2)
price_list[median_index]

450000.0

## Houses for millionaires

Out of curiosity, I was wondering how many units cost 1,000,000 or more. Using vanilla Python, I was able to retrieve this value. Only 1,492 (6.9%) of the houses cost more than 1,000,000.

In [21]:
price_list = list(df['price'])
price_ranges = {'Under 1,000,000': 0, '1,000,000 or higher': 0}
for i in range(len(price_list)):
  if price_list[i] < 1000000:
    price_ranges['Under 1,000,000'] += 1
  else:
    price_ranges['1,000,000 or higher'] += 1
price_ranges

{'Under 1,000,000': 20121, '1,000,000 or higher': 1492}

# Geographical distribution

I wasn't sure how to analyze geographical data. However, a plot of longitude against latitude provides a visual description of the properties' relative locations (the plot was suggested by Colab, I have no knowledge how to produce it yet). From the graph, it seems they are all located in a single cluster with a few outliers. The primary cluster also appears to be divided at longitude -122.4, resulting in a smaller subcluster. This is probably due to a natural feature, such as a river, which is unbuildable.

In [22]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [23]:
df[['lat', 'long']]

Unnamed: 0,lat,long
0,47.5112,-122.257
1,47.7210,-122.319
2,47.7379,-122.233
3,47.5208,-122.393
4,47.6168,-122.045
...,...,...
21608,47.6993,-122.346
21609,47.5107,-122.362
21610,47.5944,-122.299
21611,47.5345,-122.069


I noticed that the outliers were generally located at longitude -121.6 or greater. I decided to isolate those entries and determine their average price. This might shed light on whether they are luxury rural properties or cheaper units at the fringes of the urban center. The mean price was 236,588, roughly half of the total mean. This suggests that units on the outskirts are generally cheaper than in the urban core.

In [24]:
df[df.long > -121.6]['price'].mean()

590817.6470588235

# Temporal analysis

It appears that most units were built in the 2000s, while there are a few units that date from the early 20th century.

In [25]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [26]:
df.yr_built.value_counts().to_frame()

Unnamed: 0,yr_built
2014,559
2006,454
2005,450
2004,433
2003,422
...,...
1933,30
1901,29
1902,27
1935,24


The earliest units are from 1900. There are 87 such units.

In [27]:
df.yr_built.min()

1900

In [28]:
df[df.yr_built == 1900]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
14,1175000570,20150312T000000,90000.0,5,2.00,1810,4850,1.5,0,0,...,7,1810,0,1900,0,98107,47.6700,-122.394,1360,4850
115,3626039325,20141121T000000,135000.0,3,3.50,4380,6350,2.0,0,0,...,8,2780,1600,1900,1999,98117,47.6981,-122.368,1830,6350
498,9274202270,20140818T000000,180000.0,2,1.50,1490,5750,1.5,0,0,...,7,1190,300,1900,0,98116,47.5872,-122.390,1590,4025
537,5694500105,20141204T000000,185000.0,2,2.00,1510,4000,1.0,0,0,...,7,1010,500,1900,0,98103,47.6582,-122.345,1920,4000
703,7011200260,20141219T000000,195000.0,4,2.00,1400,3600,1.0,0,0,...,7,1100,300,1900,0,98119,47.6385,-122.370,1630,2048
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19063,1702901340,20140613T000000,840000.0,3,2.00,2910,6600,2.0,0,0,...,7,1920,990,1900,1988,98118,47.5576,-122.281,1370,5500
19137,3388110230,20140729T000000,850000.0,4,1.75,1790,7175,1.5,0,0,...,6,1410,380,1900,0,98168,47.4963,-122.318,1790,8417
19319,4083302225,20141014T000000,870000.0,4,3.00,2550,3784,1.5,0,0,...,8,1750,800,1900,0,98103,47.6559,-122.338,2100,4560
19385,2420069042,20150424T000000,875000.0,3,2.00,1553,6550,1.0,0,0,...,7,1553,0,1900,2001,98022,47.2056,-121.994,1010,10546


As before, I was curious if heritage units are more expensive or less expensive than recent ones. It appears that the mean price of units built before 1930 is 602,613, above the total mean of 540,182. Therefore, indeed, heritage houses are more expensive in this locality.

In [29]:
df[df.yr_built < 1930]['price'].mean()

457566.2633371169

# DataFrame from Dictionary

In [30]:
import pandas as pd

In [31]:
myDict = {
    'artist_first_name': ['Vincent', 'Henri', 'Pablo', 'Gustav'],
    'artist_last_name': ['van Gogh', 'Matisse', 'Picasso', 'Klimt'],
    'year_of_birth': [1853, 1869, 1881, 1862],
    'year_of_death': [1890, 1954, 1973, 1918],
    'nationality': ['Dutch', 'French', 'Spanish', 'Austrian'],
}
artist_df = pd.DataFrame.from_dict(myDict)

In [32]:
artist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   artist_first_name  4 non-null      object
 1   artist_last_name   4 non-null      object
 2   year_of_birth      4 non-null      int64 
 3   year_of_death      4 non-null      int64 
 4   nationality        4 non-null      object
dtypes: int64(2), object(3)
memory usage: 288.0+ bytes


In [33]:
artist_df['nationality'].value_counts()

Dutch       1
French      1
Spanish     1
Austrian    1
Name: nationality, dtype: int64

In [34]:
birth_years = list(artist_df['year_of_birth'])
death_years = list(artist_df['year_of_death'])
age_at_death = []
for i in range(len(birth_years)):
  age_at_death.append(death_years[i] - birth_years[i])
age_at_death

[37, 85, 92, 56]

# Pandas Documentation and Data Types

The main datatypes we used in class were DataFrame and Series. Different methods return different value types, which may also include ndarrays.

For example, in the following snippet, we access the 'zipcode' column of the DataFrame, which returns a Series. The unique() method is called on the Series to return an array (ndarray).

In [35]:
df['zipcode'].unique()

array([98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, 98146,
       98038, 98007, 98115, 98107, 98126, 98019, 98103, 98002, 98133,
       98040, 98092, 98030, 98119, 98112, 98052, 98027, 98117, 98058,
       98001, 98056, 98166, 98023, 98070, 98148, 98105, 98042, 98008,
       98059, 98122, 98144, 98004, 98005, 98034, 98075, 98116, 98010,
       98118, 98199, 98032, 98045, 98102, 98077, 98108, 98168, 98177,
       98065, 98029, 98006, 98109, 98022, 98033, 98155, 98024, 98011,
       98031, 98106, 98072, 98188, 98014, 98055, 98039])

Additionally, I combined some pandas with vanilla Python to determine how many unique values for zipcode there were.

In [36]:
len(df['zipcode'].unique())

70

The most interesting thing I learned through Pandas documentation is how to convert a Series to a DataFrame by using the to_frame() method. It also helped me answer the question raised during the class about the difference between calling df['yr_built'] (returns Series) vs. df[['yr_built']] (returns DataFrame). I am still discovering how Series and DataFrames can be used differently, as this is the first time I have encountered them. However, the question has been bugging me since it was raised in class and I am glad that I was able to stumble across the answer. Below, I use the value_counts() method to retrieve a Series, which I then convert into a DataFrame. I am still not sure how useful this is!

In [37]:
df.yr_built.value_counts()

2014    559
2006    454
2005    450
2004    433
2003    422
       ... 
1933     30
1901     29
1902     27
1935     24
1934     21
Name: yr_built, Length: 116, dtype: int64

In [38]:
df.yr_built.value_counts().to_frame()

Unnamed: 0,yr_built
2014,559
2006,454
2005,450
2004,433
2003,422
...,...
1933,30
1901,29
1902,27
1935,24
