# EDA

The os module has a perfect method to list files in a directory.
Pandas json normalize could work here but is not necessary to convert the JSON data to a dataframe.
You may need a nested for-loop to access each sale!
We've put a lot of time into creating the structure of this repository, and it's a good example for future projects. In the file functions_variables.py, there is an example function that you can import and use. If you have any variables, functions or classes that you want to make, they can be put in the functions_variables.py file and imported into a notebook. Note that only .py files can be imported into a notebook. If you want to import everything from a .py file, you can use the following:
from functions_variables import *
If you just import functions_variables, then each object from the file will need to be prepended with "functions_variables"
Using this .py file will keep your notebooks very organized and make it easier to reuse code between notebooks.

In [89]:
# (this is not an exhaustive list of libraries)
import pandas as pd
import numpy as np
import os
import json
from pprint import pprint
from functions_variables import encode_tags

## Data Importing

In [90]:

# Define the data directory and file path
data_dir = "/Users/erum/LHL-midterm/data"
sample_file = os.path.join(data_dir, "AK_Juneau_0.json")

# Load JSON data
with open(sample_file, "r") as f:
    data = json.load(f)

# Convert JSON to a DataFrame
df = pd.DataFrame(data)

# Save to CSV only if df is defined
if not df.empty:
    df.to_csv("real_estate_listings_augusta.csv", index=False)
    print("CSV file created successfully!")
else:
    print("Warning: The DataFrame is empty. Check the JSON structure.")


CSV file created successfully!


In [80]:
# load one file first to see what type of data you're dealing with and what attributes it has

# Define the data directory
data_dir = "/Users/erum/LHL-midterm/data"

# Inspect one file
sample_file = os.path.join(data_dir, "ME_Augusta_4.json")

with open(sample_file, "r") as f:
    data = json.load(f)

# Print a preview of the data
print(json.dumps(data, indent=4))

# Save to CSV
df.to_csv("real_estate_listings_augusta.csv", index=False)
print("CSV file created successfully!")


{
    "status": 200,
    "data": {
        "total": 0,
        "count": 0,
        "results": {}
    }
}
CSV file created successfully!


In [74]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   primary_photo     8159 non-null   object
 1   last_update_date  8125 non-null   object
 2   source_type       8159 non-null   object
 3   agent_offices     8159 non-null   object
dtypes: object(4)
memory usage: 255.1+ KB
None


In [91]:
print(df.describe(include="all"))

        status  data
count      3.0   3.0
unique     NaN   2.0
top        NaN   8.0
freq       NaN   2.0
mean     200.0   NaN
std        0.0   NaN
min      200.0   NaN
25%      200.0   NaN
50%      200.0   NaN
75%      200.0   NaN
max      200.0   NaN


In [None]:
# Initialize a list to store extracted sale records
data_list = []

# Loop through each JSON file
for file in json_files:
    file_path = os.path.join(data_dir, file)  # Construct full file path
    
    with open(file_path, "r") as f:
        raw_data = json.load(f)  # Load JSON data
        
        # Extract property listings (assuming structure is in "data" -> "results")
        listings = raw_data.get("data", {}).get("results", [])

        if not isinstance(listings, list):  # Check if "results" is not a list
            print(f"Skipping malformed file: {file}")
            continue  # Skip files without proper listings

        # Process each listing
        for listing in listings:
            sale_record = {
                "property_id": listing.get("property_id", "Unknown"),
                "permalink": listing.get("permalink", "Unknown"),
                "status": listing.get("status", "Unknown"),
                "year_built": listing.get("description", {}).get("year_built", "Unknown"),
                "garage": listing.get("description", {}).get("garage", "Unknown"),
                "stories": listing.get("description", {}).get("stories", "Unknown"),
                "beds": listing.get("description", {}).get("beds", "Unknown"),
                "baths": listing.get("description", {}).get("baths", "Unknown"),
                "type": listing.get("description", {}).get("type", "Unknown"),
                "lot_sqft": listing.get("description", {}).get("lot_sqft", "Unknown"),
                "sqft": listing.get("description", {}).get("sqft", "Unknown"),
                "sold_price": listing.get("description", {}).get("sold_price", "Unknown"),
                "sold_date": listing.get("description", {}).get("sold_date", "Unknown"),
                "list_price": listing.get("list_price", "Unknown"),
                "last_update_date": listing.get("last_update_date", "Unknown"),
                "city": listing.get("location", {}).get("address", {}).get("city", "Unknown"),
                "state": listing.get("location", {}).get("address", {}).get("state", "Unknown"),
                "postal_code": listing.get("location", {}).get("address", {}).get("postal_code", "Unknown"),
                "street_view_url": listing.get("location", {}).get("street_view_url", "Unknown"),
                "tags": ", ".join(listing.get("tags", [])) if isinstance(listing.get("tags"), list) else "Unknown"  # Extract tags as a comma-separated string
            }
            data_list.append(sale_record)

# Convert extracted data into a DataFrame
df = pd.DataFrame(data_list)

# Print DataFrame summary to verify structure
print(df.info())

# Save to CSV for easier analysis
df.to_csv("expanded_real_estate_listings.csv", index=False)
print("CSV file created successfully!")





Skipping malformed file: ME_Augusta_4.json
Skipping malformed file: MS_Jackson_0.json
Skipping malformed file: MS_Jackson_1.json
Skipping malformed file: WY_Cheyenne_4.json
Skipping malformed file: VT_Montpelier_4.json
Skipping malformed file: WY_Cheyenne_3.json
Skipping malformed file: SD_Pierre_0.json
Skipping malformed file: ME_Augusta_2.json
Skipping malformed file: VT_Montpelier_3.json
Skipping malformed file: ME_Augusta_3.json
Skipping malformed file: VT_Montpelier_2.json
Skipping malformed file: SD_Pierre_1.json
Skipping malformed file: WY_Cheyenne_2.json
Skipping malformed file: SD_Pierre_2.json
Skipping malformed file: MS_Jackson_4.json
Skipping malformed file: NH_Concord_4.json
Skipping malformed file: WY_Cheyenne_1.json
Skipping malformed file: VT_Montpelier_1.json
Skipping malformed file: ME_Augusta_0.json
Skipping malformed file: ND_Bismarck_2.json
Skipping malformed file: HI_Honolulu_3.json
Skipping malformed file: ND_Bismarck_3.json
Skipping malformed file: VT_Montpelier

In [132]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   property_id       8159 non-null   object 
 1   permalink         8159 non-null   object 
 2   status            8159 non-null   object 
 3   year_built        7316 non-null   float64
 4   garage            4448 non-null   float64
 5   stories           6260 non-null   float64
 6   beds              7504 non-null   float64
 7   baths             7980 non-null   float64
 8   type              8125 non-null   object 
 9   lot_sqft          6991 non-null   float64
 10  sqft              7323 non-null   float64
 11  sold_price        6716 non-null   float64
 12  sold_date         8159 non-null   object 
 13  list_price        7721 non-null   float64
 14  last_update_date  8125 non-null   object 
 15  city              8154 non-null   object 
 16  state             8159 non-null   object 


In [114]:
print(df.describe(include="all"))

       property_id                                          permalink status  \
count         8159                                               8159   8159   
unique        1795                                               1795      1   
top     6007726455  12312-Birchfalls-Dr_Raleigh_NC_27614_M60077-26455   sold   
freq             5                                                  5   8159   
mean           NaN                                                NaN    NaN   
std            NaN                                                NaN    NaN   
min            NaN                                                NaN    NaN   
25%            NaN                                                NaN    NaN   
50%            NaN                                                NaN    NaN   
75%            NaN                                                NaN    NaN   
max            NaN                                                NaN    NaN   

         year_built       garage      s

## Data Cleaning and Wrangling

At this point, ensure that you have all sales in a dataframe.
- Take a quick look at your data (i.e. `.info()`, `.describe()`) - what do you see?
- Is each cell one value, or do some cells have lists?
- What are the data types of each column?
- Some sales may not actually include the sale price (target).  These rows should be dropped.
- There are a lot of NA/None values.  Should these be dropped or replaced with something?
    - You can drop rows or use various methods to fills NA's - use your best judgement for each column 
    - i.e. for some columns (like Garage), NA probably just means no Garage, so 0
- Drop columns that aren't needed
    - Don't keep the list price because it will be too close to the sale price. Assume we want to predict the price of houses not yet listed

In [None]:
# load and concatenate data here
# drop or replace values as necessary

### Dealing with Tags

Consider the fact that with tags, there are a lot of categorical variables.
- How many columns would we have if we OHE tags, city and state?
- Perhaps we can get rid of tags that have a low frequency.

In [130]:
# OHE categorical variables/ tags here
# tags will have to be done manually
unique_tags = df["tags"].explode().nunique()  # Count unique tags

unique_tags = df["tags"].explode().nunique()  # Count unique tags
unique_cities = df["city"].nunique()  # Count unique cities
unique_states = df["state"].nunique()  # Count unique states

total_new_columns = unique_tags + unique_cities + unique_states
print(f"Estimated new columns after OHE: {total_new_columns}")


Estimated new columns after OHE: 1775


In [122]:
tag_counts = df["tags"].explode().value_counts()  # Get tag frequencies
common_tags = tag_counts[tag_counts > 5].index  # Keep tags appearing more than 5 times

df["tags"] = df["tags"].apply(lambda x: [tag for tag in x if tag in common_tags])


In [124]:
df["tags"] = df["tags"].apply(lambda x: ", ".join(x) if isinstance(x, list) else "Unknown")



In [125]:
#Exploding Lists Before Encoding
df = df.explode("tags")


In [127]:
#Converting Lists to Tuples
df["tags"] = df["tags"].apply(lambda x: tuple(x) if isinstance(x, list) else ("Unknown",))


In [129]:
df.to_csv("real_estate_listings_with_tags.csv", index=False)
print("CSV file created successfully!")


CSV file created successfully!


### Dealing with Cities

- Sales will vary drastically between cities and states.  Is there a way to keep information about which city it is without OHE?
- Could we label encode or ordinal encode?  Yes, but this may have undesirable effects, giving nominal data ordinal values.
- What we can do is use our training data to encode the mean sale price by city as a feature (a.k.a. Target Encoding)
    - We can do this as long as we ONLY use the training data - we're using the available data to give us a 'starting guess' of the price for each city, without needing to encode city explicitly
- If you replace cities or states with numerical values (like the mean price), make sure that the data is split so that we don't leak data into the training selection. This is a great time to train test split. Compute on the training data, and join these values to the test data
- Note that you *may* have cities in the test set that are not in the training set. You don't want these to be NA, so maybe you can fill them with the overall mean

In [None]:
# perform train test split here
# do something with state and city

## Extra Data - STRETCH

> This doesn't need to be part of your Minimum Viable Product (MVP). We recommend you write a functional, basic pipeline first, then circle back and join new data if you have time

> If you do this, try to write your downstream steps in a way it will still work on a dataframe with different features!

- You're not limited to just using the data provided to you. Think/ do some research about other features that might be useful to predict housing prices. 
- Can you import and join this data? Make sure you do any necessary preprocessing and make sure it is joined correctly.
- Example suggestion: could mortgage interest rates in the year of the listing affect the price? 

In [None]:
# import, join and preprocess new data here

## EDA/ Visualization
l of the EDA that you've been learning about?  Now is a perfect time for it!
- Look at di
Remember alstributions of numerical variables to see the shape of the data and detect outliers.    
    - Consider transforming very skewed variables
- Scatterplots of a numerical variable and the target go a long way to show correlations.
- A heatmap will help detect highly correlated features, and we don't want these.
    - You may have too many features to do this, in which case you can simply compute the most correlated feature-pairs and list them
- Is there any overlap in any of the features? (redundant information, like number of this or that room...)

In [None]:
# perform EDA here

## Scaling and Finishing Up

Now is a great time to scale the data and save it once it's preprocessed.
- You can save it in your data folder, but you may want to make a new `processed/` subfolder to keep it organized