# California House Price Model

## Project Description
Your task is to predict the average house values in Californian districts, given a number of features for each district:
- Location (Longitude and Latitude)
- Average Houses' Age
- Total Rooms
- Total Bedrooms
- District Population
- Number of Households
- Average Annual Income
- Average House Value
- Proximity to Ocean Categories:One Hour Away from Ocean (1H Ocean), Inland, Near Ocean, Near Bay, Island    

## 1. Import the Basic Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Importing the Dataset

In [None]:
data = pd.read_csv('CaliforniaHouseData.csv')
data

## 3. Exploratory Data Analysis (EDA)
In this step, we will use data visualization methods to explore the main characteristics of the dataset.

### 3.1. Reviewing the data for some general information

In [None]:
data.info()

In [None]:
data['Ocean_proximity'].value_counts()

### 3.2. Reviewing overall statistical information

In [None]:
data.describe()

### 3.3. Getting some insights by plotting differnet variables

### 3.4. Visualizing the data based on the location

If you pay attention, you notice, the shape of the above graph looks like California! We can also add a new paramter, alpha (transparency), to show shows the denser area with a more intense color. Let's set alpha=0.1

Now, you can clearly see some high-density areas: e.g., areas near the Bay area, Los Angeles, and San Diego. Furthermore, we can visualize the data to check if housing price is related to location and population density.

### 3.5. Visualizing the data based on the house value and population
We can plot the above graph; but this time we show each district's population by radius of each circle (option s), and the color represnts the price (option c). So, the bigger the radius of the dots, means higher population, and a range of colors for the value.   

We can see that housing price seems to be very related to location and population density.

### 3.6. Checking correlations with the house value
- Here, we want to see how house value correlates to differnt parameters. We can compute the standard correlation coefficint (The Pearsons's r) between house value and other parameters. 
- The correlation coefficient ranges from -1 to +1. When it is close to +1, it means that there is a strong positive correlation (for example, by increasing the parameter the house value goes up); and when it is close to -1, it means that there is a strong negative correlation(for example, by increasing the parameter the house value comes down).<br>
- Please note, correlation coefficient 1 means, comparing a parameter with itself. 

### 3.7. Creating new features and checking their correlations with the house value

Rooms_per_household and bedrooms_per_room have better correlations with the house value than population_per_household.

## 4. Preprocessing the Data for Machine Learning

In [None]:
data.info()

### 4.1. Rearranging the sequence of the data
- Put all numerical data in the first 10 columns.
- Put house value at the 2nd last column.
- Put the catogorical data (ocean proximity) as the last column.

In [None]:
data.info()

We have some missing data.

### 4.2. Scaling the numerical data 

### 4.3. Seperating the input data from the output data.

### 4.4. Imputing the missing numerical data

### 4.5. Taking care of the outliers in the numerical data

### 4.6. Encoding the categorical data

##### Combining the numerical and categorical training data

##### A quick check on the preprocssed training data

### 4.7. Splitting the dataset into the training set and test Set

## 5. Training the Linear Regression Model with the Training Set

## 6. Checking the Trained Model with the Test Set