<a href="https://colab.research.google.com/github/armandossrecife/mymachinelearning/blob/main/my_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California Housing Dataset

The `fetch_california_housing` function in scikit-learn provides a dataset containing information about house prices in California. It's a popular dataset used for regression and machine learning tasks related to real estate price prediction. Here's a breakdown of the data:

**Features:**

The dataset includes eight features (columns) describing various attributes of housing and its surroundings. These features are:

1. **Median Age of Housing Unit (MedianYearBuilt):** This numeric feature represents the median year in which the housing units in a block group were built.
2. **Median Income (TotalMedIncome):** This numeric feature indicates the median value of household income within a block group.
3. **Median Housing Density (PeoplePerAcre):** This numeric feature represents the average number of people living per acre of land within a block group.
4. **Geographical Coordinates (Latitude & Longitude):** These two numeric features indicate the geographic location (latitude and longitude) of the centroid of a block group.
5. **Total Rooms:** This numeric feature represents the total number of rooms within a block group.
6. **Total Bedrooms:** This numeric feature indicates the total number of bedrooms within a block group.
7. **House Age (MedianYearBuilt):** This numeric feature is likely redundant with "Median Age of Housing Unit" and might be a duplicate depending on the data source used.
8. **Ocean Proximity (MedianaOfOceanProximity):** This numeric feature indicates the distance from the centroid of a block group to the Pacific Ocean (in miles). Higher values represent locations further away from the ocean.

**Target Variable:**

The dataset provides a single target variable named "Median House Value" which is a numeric feature representing the median value of owner-occupied homes in a block group. This is the value you'll try to predict using regression models.

**Data Format:**

The `fetch_california_housing` function returns a named tuple containing two elements:

1. **data:** This is a NumPy array of shape (number of samples, number of features) containing the feature values for each data point.
2. **target:** This is a NumPy array of shape (number of samples,) containing the target variable (median house value) for each data point.

**Example Usage:**

```python
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
data = fetch_california_housing()

# Access features (X)
X = data.data

# Access target variable (y)
y = data.target

# Print feature names
print(data.feature_names)
```

This code snippet demonstrates how to load the dataset, access features and target variable, and print the feature names for better understanding.

By understanding the features and target variable in the California Housing Prices dataset, you can use it for various machine learning tasks, particularly those related to house price prediction and analysis of factors influencing housing prices.

In [16]:
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
data = fetch_california_housing()

# Access features (X)
X = data.data

# Access target variable (y)
y = data.target

# Print feature names
print(data.feature_names)

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [17]:
import pandas as pd

# Load the California housing dataset
data = fetch_california_housing()

# Create a DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MedianHouseValue'] = data.target  # Add target variable as a column

In [18]:
# Print the DataFrame (optional)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


The `fetch_california_housing` dataset from scikit-learn offers various possibilities for data visualization using libraries like matplotlib or seaborn. Here are some informative graphics you can generate:

**1. Distribution of Features:**

* **Histograms:** Create histograms to visualize the distribution of each feature (e.g., median income, house age) to understand how the data is spread out.
* **Density Plots:** Overlay density plots on histograms for a smoother representation of the feature distributions.

**2. Feature Relationships:**

* **Scatter Plots:** Create scatter plots to explore the relationships between pairs of features (e.g., median income vs. median house value). This can reveal potential correlations or patterns.
* **Pairplots:** Use seaborn's `pairplot` function to create a matrix of scatter plots, visualizing all pairwise relationships between features simultaneously.

**3. Target Variable Analysis:**

* **Boxplots:** Create boxplots to visualize the distribution of median house value across different categories of another feature (e.g., boxplots of median house value by ocean proximity). This can reveal how location or other factors might influence house prices.
* **Heatmaps:** Generate heatmaps using seaborn to represent the correlation matrix between all features and the target variable (median house value). This helps identify features that are highly correlated with house prices.

**4. Model Evaluation (if applicable):**

* **Scatter Plots:** After training a regression model, use scatter plots to compare the predicted median house values with the actual values. This helps visualize the model's performance.
* **Residual Plots:** Create residual plots to analyze the distribution of errors (differences between predicted and actual values). This can provide insights into potential model biases or areas for improvement.

**Additional Tips:**

* Consider color-coding data points in scatter plots or boxplots based on another feature for further insights.
* Explore interactive visualization libraries like Plotly for creating more dynamic and user-friendly visualizations.

By creating these different types of graphics, you can gain valuable insights from the `fetch_california_housing` dataset. You can understand the distribution of features, relationships between features and the target variable, and potentially evaluate the performance of machine learning models for house price prediction.

Here's an example of code that implements a regression model using scikit-learn with the California Housing Prices dataset from `fetch_california_housing`:

```python
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
data = fetch_california_housing()

# Separate features (X) and target variable (y)
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model performance using mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
```

This code performs the following steps:

1. **Import libraries:**
   - `fetch_california_housing` to load the dataset.
   - `LinearRegression` from `sklearn.linear_model` for the regression model.
   - `train_test_split` from `sklearn.model_selection` for splitting data.
   - `mean_squared_error` from `sklearn.metrics` for evaluation.

2. **Load the dataset:**
   - Uses `fetch_california_housing` to retrieve the data.

3. **Separate features and target:**
   - Assigns features (`data.data`) to `X` and target variable (`data.target`) to `y`.

4. **Split data:**
   - Splits `X` and `y` into training and testing sets using `train_test_split`. Here, 20% of the data is used for testing (adjustable with `test_size`). A random seed (`random_state`) ensures reproducibility.

5. **Create the model:**
   - Initializes a `LinearRegression` object for linear regression.

6. **Train the model:**
   - Uses `model.fit` to train the model on the training data (`X_train`, `y_train`).

7. **Make predictions:**
   - Uses `model.predict` to predict housing prices on the testing data (`X_test`).

8. **Evaluate the model:**
   - Calculates the mean squared error (MSE) between the actual and predicted values using `mean_squared_error`. A lower MSE indicates a better fit.

**Note:**

- This example uses a linear regression model. You can explore other regression models available in scikit-learn like Ridge Regression or Lasso Regression.
- Consider feature scaling or selection techniques before training the model for potentially better performance.
- This is a basic example. You can use more advanced evaluation metrics and techniques for a comprehensive analysis.


In [19]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
data = fetch_california_housing()

# Separate features (X) and target variable (y)
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model performance using mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Mean Squared Error: 0.56


In [20]:
# Assuming you already have X_test and y_test from your previous code

# Select a few samples from the testing data (adjust the index range)
sample_X_test = X_test[10:20]  # Select rows from index 10 to 19 (inclusive)
sample_y_test = y_test[10:20]  # Select corresponding target values


In [21]:
# Make predictions on the sample data
sample_y_pred = model.predict(sample_X_test)

# Print the sample data and predictions
print("Sample True Values (y_test):", sample_y_test)
print("Sample Predicted Values (y_pred):", sample_y_pred)


Sample True Values (y_test): [1.232 2.539 2.151 2.205 2.198 1.362 1.784 1.875 1.398 1.375]
Sample Predicted Values (y_pred): [0.93896156 1.90122177 1.75871178 2.2501598  2.54086976 1.9174049
 2.38648295 2.01093032 2.22740934 1.11853152]
