# Generating and Analyzing Home Sales Data

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate home sales data 100 records

This Jupyter notebook provides a simple and efficient way to generate home sales data with 100 records. The notebook includes five sections: importing necessary libraries, defining a function for generating data, generating 100 records of home sales data, cleaning and preparing the generated data for analysis, and conducting exploratory data analysis by creating visualizations and calculating summary statistics. The generated data can be customized based on specified parameters such as number of records, price range, and location.

## Define Functions

In [None]:
def generate_sales_data(num_records, price_range, location):
    import random
    sales_data = []
    for i in range(num_records):
        price = random.randint(price_range[0], price_range[1])
        sales_data.append({'price': price, 'location': location})
    return sales_data

## Generate Data

In [None]:
import numpy as np
import pandas as pd

In [None]:
def generate_home_sales_data(num_records):
    data = pd.DataFrame(columns=['Address', 'City', 'State', 'Zipcode', 'Price', 'Bedrooms', 'Bathrooms', 'Square Footage', 'Year Built'])
    for i in range(num_records):
        address = f"{np.random.randint(1, 10000)} Main St"
        city = np.random.choice(['Los Angeles', 'San Francisco', 'New York', 'Chicago', 'Seattle'])
        state = np.random.choice(['CA', 'NY', 'IL', 'WA'])
        zipcode = f"{np.random.randint(10000, 99999)}"
        price = np.random.randint(100000, 1000000)
        bedrooms = np.random.randint(1, 6)
        bathrooms = np.random.randint(1, 4)
        square_footage = np.random.randint(1000, 5000)
        year_built = np.random.randint(1900, 2021)
        data.loc[i] = [address, city, state, zipcode, price, bedrooms, bathrooms, square_footage, year_built]
    return data

In [None]:
home_sales_data = generate_home_sales_data(100)

## Data Cleaning and Preparation

In [None]:
import pandas as pd
def clean_data():
    data = pd.read_csv('home_sales_data.csv')
    data.drop_duplicates(inplace=True)
    data.fillna(value=0, inplace=True)
    data['price'] = data['price'].astype(int)
    data['bedrooms'] = data['bedrooms'].astype(int)
    data['bathrooms'] = data['bathrooms'].astype(int)
    data.to_csv('cleaned_home_sales_data.csv', index=False)
    return 'Data cleaning and preparation complete.'

## Data Analysis

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load the generated home sales data
data = pd.read_csv('home_sales_data.csv')

In [None]:
# Display the first 5 rows of the data
data.head()

In [None]:
# Calculate summary statistics for the data
data.describe()

In [None]:
# Create a histogram of the sale prices
plt.hist(data['Sale_Price'], bins=20)
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.title('Distribution of Sale Prices')
plt.show()

In [None]:
# Create a scatter plot of the sale price vs. the square footage
plt.scatter(data['Square_Footage'], data['Sale_Price'])
plt.xlabel('Square Footage')
plt.ylabel('Sale Price')
plt.title('Sale Price vs. Square Footage')
plt.show()

In [None]:
# Create a box plot of the sale price by the number of bedrooms
data.boxplot(column='Sale_Price', by='Bedrooms')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Sale Price')
plt.title('Sale Price by Number of Bedrooms')
plt.show()