# About Dataset
### Context

This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
Content

The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:

- longitude

- latitude

- housing_median_age

- total_rooms

- total_bedrooms

- population

- households

- median_income

- median_house_value

- ocean_proximity

### About the Features

1. `longitude`: A measure of how far west a house is; a higher value is farther west

2. `latitude`: A measure of how far north a house is; a higher value is farther north

3. `housingMedianAge`: Median age of a house within a block; a lower number is a newer building

4. `totalRooms`: Total number of rooms within a block

5. `totalBedrooms`: Total number of bedrooms within a block

6. `population`: Total number of people residing within a block

7. `households`: Total number of households, a group of people residing within a home unit, for a block

8. `medianIncome`: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. `medianHouseValue`: Median house value for households within a block (measured in US Dollars)

10. `oceanProximity`: Location of the house w.r.t ocean/sea

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
df = pd.read_csv("Cleaned_Dataset.csv")

In [None]:
# Copying the df
df_copy = df.copy()

## Feature Engineering

In [None]:
# Feature engineering

for i in range(len(df["ocean_proximity"])):
    if df_copy.loc[i,"ocean_proximity"]=="NEAR BAY":
        df_copy.loc[i,"ocean_proximity"]=0
    elif df_copy.loc[i,"ocean_proximity"]=="<1H OCEAN":
        df_copy.loc[i,"ocean_proximity"]=1
    elif df_copy.loc[i,"ocean_proximity"]=="INLAND":
        df_copy.loc[i,"ocean_proximity"]=2
    elif df_copy.loc[i,"ocean_proximity"]=="NEAR OCEAN":
        df_copy.loc[i,"ocean_proximity"]=3
    elif df_copy.loc[i,"ocean_proximity"]=="ISLAND":
        df_copy.loc[i,"ocean_proximity"]=4

    

- replaced the categorical features with numerical values:
    - 0 : NEAR BAY
    - 1 : <1H OCEAN
    - 2 : INLAND
    - 3 : NEAR OCEAN
    - 4 : ISLAND

In [None]:
df_copy["ocean_proximity"]

array([0, 1, 2, 3, 4], dtype=object)

In [None]:
# Converted the object type to integer
df_copy["ocean_proximity"] =df_copy["ocean_proximity"].astype(int)

In [40]:
df = df_copy.copy()

In [41]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129,322,126,8.3252,45.26,0
1,-122.22,37.86,21,7099,1106,2401,1138,8.3014,35.85,0
2,-122.24,37.85,52,1467,190,496,177,7.2574,35.21,0
3,-122.25,37.85,52,1274,235,558,219,5.6431,34.13,0
4,-122.25,37.85,52,1627,280,565,259,3.8462,34.22,0


In [42]:
df.to_csv("Cat_fixed_Cleaned_Dataset.csv")