FIRST SECTION: CATEGORIZING CITIES AS HOT, MEDIUM, OR COLD

In [416]:
# Key for changing state names to their abbreviations
us_state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA',
    'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT',
    'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM',
    'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK',
    'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
    'Puerto Rico': 'PR', 'Virgin Islands': 'VI', 'District of Columbia':'DI', 'New Brunswick': 'NB',
    'Guam': 'GU'
}

# FIRST SECTION: CLASSIFYING TEMPERTATURES

In [417]:
import pandas as pd

# Read data
tdf = pd.read_csv("citytemperatures.csv")

# Rename CITY to City
tdf = tdf.rename(columns={"CITY": "City"})

# Convert month columns (2 to 13) to numeric
tdf.iloc[:, 2:14] = tdf.iloc[:, 2:14].apply(pd.to_numeric, errors="coerce")

# Calculate the average temperature across months
tdf["Temperature"] = tdf.iloc[:, 2:14].mean(axis=1)

# Compute the average temperature across multiple instances of the same city if there are multiple inputs (multiple rows for each city)
tdf_avg_temp = tdf.groupby('City', as_index=False)['Temperature'].mean()

# Drop duplicates
tdf = tdf.drop_duplicates(subset=['City'], keep='first')

# Calculate quantiles for splitting data into 3 parts
quantiles = tdf["Temperature"].quantile([0.33, 0.66])

# Function to classify temperature based on quantiles
def classify_temp(temp):
    if temp >= quantiles[0.66]:
        return "Hot"
    elif temp >= quantiles[0.33]:
        return "Medium"
    else:
        return "Cold"

# Apply classification
tdf["Category"] = tdf["Temperature"].apply(classify_temp)

# Select only 'City' and 'Category' columns for final (tdf final = tdff)
tdff = tdf[["City", "Category"]]

# Rename Category to Temperature
tdff = tdff.rename(columns={"Category": "Temperature"})

tdff

Unnamed: 0,City,Temperature
0,"BIRMINGHAM,AL",Hot
1,"HUNTSVILLE,AL",Medium
2,"MOBILE,AL",Hot
3,"MONTGOMERY,AL",Hot
4,"ANCHORAGE,AK",Cold
...,...,...
259,"POHNPEI- CAROLINE IS.,PC",Hot
260,"CHUUK- E. CAROLINE IS.,PC",Hot
261,"YAP- W CAROLINE IS.,PC",Hot
262,"SAN JUAN,PR",Hot


# SECOND SECTION: COMBINING THIS WITH HOUSING DATA

In [418]:
# Read first houying data
hdf = pd.read_csv("housingdata1.csv")

# Drop rows where 'City' or 'State' is NaN
hdf = hdf.dropna(subset=['City', 'State'])

#revised housing dataframe
hdf = hdf[["State", "City", "Bedroom", "Bathroom", "Area", "LotArea", "MarketEstimate", "RentEstimate", "Price"]]

# Read second housing dataset
hdf2 = pd.read_csv("housingdata2.csv")

# Simplify second dataset
hdf2 = hdf2[["state", "city", "bed", "bath", "house_size", "acre_lot", "price"]]

# Rename columns in second dataset to match the first one
hdf2.columns = ['State', 'City', 'Bedroom', 'Bathroom', 'Area', 'LotArea', 'Price']

# Drop rows where 'City' or 'State' is NaN in hd2
hdf2 = hdf2.dropna(subset=['City', 'State'])

# Rename states to their abbreviations
hdf2['State'] = hdf2['State'].map(us_state_abbrev).fillna(hdf2['State'])

# Filter out all non-abbreviated places
hdf2 = hdf2[hdf2['State'].str.match(r'^[A-Z]{2}$', na=False)]

# Combining the two hdfs
hdf = pd.concat([hdf, hdf2], ignore_index=True)

#reformat housing data city to CITY,SI  (SI = STATE INITIAL)
hdf["City"] = hdf["City"] + "," + hdf["State"]

# Ensure key column (city) is of the same type and normalize it so they are all in the same format
hdf.loc[:, "City"] = hdf["City"].str.strip().str.lower()
tdff.loc[:, "City"] = tdff["City"].str.strip().str.lower()

# Merge datasets on 'city'
merged_df = hdf.merge(tdff[["City", "Temperature"]], on="City", how="left")

merged_df

Unnamed: 0,State,City,Bedroom,Bathroom,Area,LotArea,MarketEstimate,RentEstimate,Price,Temperature
0,AL,"saraland,al",4.0,2.0,1614.0,0.38050,240600.0,1599.0,239900.0,
1,AL,"southside,al",3.0,2.0,1474.0,0.67034,186700.0,1381.0,1.0,
2,AL,"robertsdale,al",3.0,2.0,1800.0,3.20000,,,259900.0,
3,AL,"gulf shores,al",2.0,2.0,1250.0,,,,342500.0,
4,AL,"chelsea,al",3.0,3.0,2224.0,0.26000,336200.0,1932.0,335000.0,
...,...,...,...,...,...,...,...,...,...,...
2249097,WA,"richland,wa",4.0,2.0,3600.0,0.33000,,,359900.0,
2249098,WA,"richland,wa",3.0,2.0,1616.0,0.10000,,,350000.0,
2249099,WA,"richland,wa",6.0,3.0,3200.0,0.50000,,,440000.0,
2249100,WA,"richland,wa",2.0,1.0,933.0,0.09000,,,179900.0,


# SECTION 2.5: IMPORTING IN OTHER STATEWISE TEMPERATURE DATA TO REPLACE NAN VALUES

In [419]:
# Read state temperature data
stdf = pd.read_csv("averagestatetemperatures.csv")[["state", "average_temp"]]

# Rename state to State
stdf = stdf.rename(columns={"state": "State"})

# Convert full state names to abbreviations
stdf['State'] = stdf['State'].map(us_state_abbrev)

# Compute the average temperature for each state if there are multiple rows with the same state
stdf_avg_temp = stdf.groupby('State', as_index=False)['average_temp'].mean()

# Drop duplicates
stdf = stdf.drop_duplicates(subset=['State'], keep='first')

# Calculate quantiles for splitting data into 3 parts
quantiles = stdf["average_temp"].quantile([0.33, 0.66])

# Function to classify temperature based on quantiles
def classify_temp(temp):
    if temp >= quantiles[0.66]:
        return "Hot"
    elif temp >= quantiles[0.33]:
        return "Medium"
    else:
        return "Cold"

# Apply classification
stdf["Temperature"] = stdf["average_temp"].apply(classify_temp)

# Remove avg temp
stdf = stdf.drop(columns=['average_temp'])

# Manually input values for states which were not in the state dataset
stdf.loc[len(stdf)] = ['ak', 'Cold']
stdf.loc[len(stdf)] = ['pr', 'Hot']
stdf.loc[len(stdf)] = ['di', 'Medium']
stdf.loc[len(stdf)] = ['vi', 'Hot']
stdf.loc[len(stdf)] = ['nb', 'Cold']
stdf.loc[len(stdf)] = ['hi', 'Hot']
stdf.loc[len(stdf)] = ['gu', 'Hot']

# Ensure data in State column is all in the same format
stdf.loc[:, "State"] = stdf["State"].str.strip().str.lower()
merged_df.loc[:, "State"] = merged_df["State"].str.strip().str.lower()

# Merge state temps into merged df, keeping BOTH temperature columns rather than combining state temps into the merged_df temps. This is because we only want to update
# merged_df temps if there was a NaN entry there 
merged_df = merged_df.merge(stdf, on='State', how='left', suffixes=('_old', '_new'))

# Fill NaN values in the original Temperature column with values from the second dataset
merged_df['Temperature_old'] = merged_df['Temperature_old'].fillna(merged_df['Temperature_new'])

merged_df = merged_df.drop(columns=['Temperature_new'])

merged_df = merged_df.rename(columns={"Temperature_old": "Temperature"})

merged_df

Unnamed: 0,State,City,Bedroom,Bathroom,Area,LotArea,MarketEstimate,RentEstimate,Price,Temperature
0,al,"saraland,al",4.0,2.0,1614.0,0.38050,240600.0,1599.0,239900.0,Hot
1,al,"southside,al",3.0,2.0,1474.0,0.67034,186700.0,1381.0,1.0,Hot
2,al,"robertsdale,al",3.0,2.0,1800.0,3.20000,,,259900.0,Hot
3,al,"gulf shores,al",2.0,2.0,1250.0,,,,342500.0,Hot
4,al,"chelsea,al",3.0,3.0,2224.0,0.26000,336200.0,1932.0,335000.0,Hot
...,...,...,...,...,...,...,...,...,...,...
2249097,wa,"richland,wa",4.0,2.0,3600.0,0.33000,,,359900.0,Cold
2249098,wa,"richland,wa",3.0,2.0,1616.0,0.10000,,,350000.0,Cold
2249099,wa,"richland,wa",6.0,3.0,3200.0,0.50000,,,440000.0,Cold
2249100,wa,"richland,wa",2.0,1.0,933.0,0.09000,,,179900.0,Cold


# THIRD SECTION: DOING THE SAME THING WITH QOL RATING

In [420]:
# Read qol data
qoldf = pd.read_csv("qolcitydata.csv", encoding="ISO-8859-1")

# Exract only the columns we want
qoldf = qoldf[["LCITY", "LSTATE", "2016 Crime Rate", "Unemployment", "AQI%Good", "WaterQualityVPV", "%CvgCityPark", "Cost of Living", "2022 Median Income", "AVG C2I", "Diversity Rank (Race)", "Diversity Rank (Gender)"]]

# Correct format of qoldf cities to match that of our merged_df
qoldf.loc[:, "LCITY"] = qoldf["LCITY"].str.strip().str.lower()
qoldf.loc[:, "LSTATE"] = qoldf["LSTATE"].str.strip().str.lower()
qoldf["LCITY"] = qoldf["LCITY"] + "," + qoldf["LSTATE"]

# Drop rows where 'City' or 'State' is NaN
qoldf = qoldf.dropna(subset=['LCITY', 'LSTATE'])

# Drop duplicate entries of a city
qoldf = qoldf.drop_duplicates(subset="LCITY", keep="first")

# Rename LCITY to City for merging
qoldf = qoldf.rename(columns={"LCITY": "City"})

merged_df = merged_df.merge(qoldf[["City", "2016 Crime Rate", "Unemployment", "AQI%Good", "WaterQualityVPV", "%CvgCityPark", "Cost of Living", "2022 Median Income", "AVG C2I", "Diversity Rank (Race)", "Diversity Rank (Gender)"]], on="City", how="left")

merged_df

  qoldf = pd.read_csv("qolcitydata.csv", encoding="ISO-8859-1")


Unnamed: 0,State,City,Bedroom,Bathroom,Area,LotArea,MarketEstimate,RentEstimate,Price,Temperature,2016 Crime Rate,Unemployment,AQI%Good,WaterQualityVPV,%CvgCityPark,Cost of Living,2022 Median Income,AVG C2I,Diversity Rank (Race),Diversity Rank (Gender)
0,al,"saraland,al",4.0,2.0,1614.0,0.38050,240600.0,1599.0,239900.0,Hot,47/1000,3.35%,80.94%,1.0,-1,"$71,947.38","$62,409.46",115.28%,26459.0,63210.0
1,al,"southside,al",3.0,2.0,1474.0,0.67034,186700.0,1381.0,1.0,Hot,43/1000,3.12%,80.94%,0.0,-1,"$67,812.73","$58,943.92",115.05%,69642.0,79134.0
2,al,"robertsdale,al",3.0,2.0,1800.0,3.20000,,,259900.0,Hot,18/1000,2.41%,80.94%,1.0,-1,"$79,155.41","$77,884.76",101.63%,29479.0,36363.0
3,al,"gulf shores,al",2.0,2.0,1250.0,,,,342500.0,Hot,18/1000,2.41%,80.94%,1.0,-1,"$79,155.41","$77,884.76",101.63%,56013.0,31948.0
4,al,"chelsea,al",3.0,3.0,2224.0,0.26000,336200.0,1932.0,335000.0,Hot,16/1000,1.87%,80.94%,-1.0,-1,"$85,691.03","$98,419.23",87.07%,44179.0,41526.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2249097,wa,"richland,wa",4.0,2.0,3600.0,0.33000,,,359900.0,Cold,23/1000,5.31%,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0
2249098,wa,"richland,wa",3.0,2.0,1616.0,0.10000,,,350000.0,Cold,23/1000,5.31%,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0
2249099,wa,"richland,wa",6.0,3.0,3200.0,0.50000,,,440000.0,Cold,23/1000,5.31%,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0
2249100,wa,"richland,wa",2.0,1.0,933.0,0.09000,,,179900.0,Cold,23/1000,5.31%,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0


# FOURTH SECTION: FINALLY, ADD IN MEAN INCOME FOR EACH CITY

In [421]:
# Read data
idf = pd.read_csv("meancityincome.csv", encoding="ISO-8859-1")[["State_ab", "City", "Mean", "Median"]]

# Rename columns
idf = idf.rename(columns={"State_ab": "State"})
idf = idf.rename(columns={"Mean": "MeanIncome"})
idf = idf.rename(columns={"Median": "MedianIncome"})

# Correct format of idf cities to match that of our merged_df
idf.loc[:, "City"] = idf["City"].str.strip().str.lower()
idf.loc[:, "State"] = idf["State"].str.strip().str.lower()
idf["City"] = idf["City"] + "," + idf["State"]

# Average out the mean and median entries for cases where one city has multiple entries
idf = idf.groupby('City')[['MeanIncome', 'MedianIncome']].mean().round().reset_index()
idf = idf.drop_duplicates(subset='City')

# Merge with merged_df
merged_df = merged_df.merge(idf[["City", "MeanIncome", "MedianIncome"]], on="City", how="left")

nan_count = merged_df['Cost of Living'].isna().sum()
print(f"{nan_count}")

merged_df

306795


Unnamed: 0,State,City,Bedroom,Bathroom,Area,LotArea,MarketEstimate,RentEstimate,Price,Temperature,...,AQI%Good,WaterQualityVPV,%CvgCityPark,Cost of Living,2022 Median Income,AVG C2I,Diversity Rank (Race),Diversity Rank (Gender),MeanIncome,MedianIncome
0,al,"saraland,al",4.0,2.0,1614.0,0.38050,240600.0,1599.0,239900.0,Hot,...,80.94%,1.0,-1,"$71,947.38","$62,409.46",115.28%,26459.0,63210.0,43803.0,300000.0
1,al,"southside,al",3.0,2.0,1474.0,0.67034,186700.0,1381.0,1.0,Hot,...,80.94%,0.0,-1,"$67,812.73","$58,943.92",115.05%,69642.0,79134.0,,
2,al,"robertsdale,al",3.0,2.0,1800.0,3.20000,,,259900.0,Hot,...,80.94%,1.0,-1,"$79,155.41","$77,884.76",101.63%,29479.0,36363.0,67084.0,172640.0
3,al,"gulf shores,al",2.0,2.0,1250.0,,,,342500.0,Hot,...,80.94%,1.0,-1,"$79,155.41","$77,884.76",101.63%,56013.0,31948.0,65583.0,300000.0
4,al,"chelsea,al",3.0,3.0,2224.0,0.26000,336200.0,1932.0,335000.0,Hot,...,80.94%,-1.0,-1,"$85,691.03","$98,419.23",87.07%,44179.0,41526.0,78399.0,71839.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2249097,wa,"richland,wa",4.0,2.0,3600.0,0.33000,,,359900.0,Cold,...,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0
2249098,wa,"richland,wa",3.0,2.0,1616.0,0.10000,,,350000.0,Cold,...,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0
2249099,wa,"richland,wa",6.0,3.0,3200.0,0.50000,,,440000.0,Cold,...,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0
2249100,wa,"richland,wa",2.0,1.0,933.0,0.09000,,,179900.0,Cold,...,92.89%,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0


# SEC 4.5: ADDING POPULATIONS


In [422]:
pdf = pd.read_csv('uscitypopulations.csv')[['CITY', 'STATE', '2022_POPULATION']]

# Rename columns
pdf.columns = ['City', 'State', 'Population']

# Convert full state names to abbreviations
pdf['State'] = pdf['State'].map(us_state_abbrev)

# Correct formatting of cities and towns
pdf['City'] = pdf['City'].str.replace(r'\b( city| town)\b', '', case=False, regex=True).str.strip()
pdf.loc[:, "City"] = pdf["City"].str.strip().str.lower()
pdf.loc[:, "State"] = pdf["State"].str.strip().str.lower()
pdf["City"] = pdf["City"] + "," + pdf["State"]

pdf

# Merge with merged_df
merged_df = merged_df.merge(pdf[["City", "Population"]], on="City", how="left")

merged_df


Unnamed: 0,State,City,Bedroom,Bathroom,Area,LotArea,MarketEstimate,RentEstimate,Price,Temperature,...,WaterQualityVPV,%CvgCityPark,Cost of Living,2022 Median Income,AVG C2I,Diversity Rank (Race),Diversity Rank (Gender),MeanIncome,MedianIncome,Population
0,al,"saraland,al",4.0,2.0,1614.0,0.38050,240600.0,1599.0,239900.0,Hot,...,1.0,-1,"$71,947.38","$62,409.46",115.28%,26459.0,63210.0,43803.0,300000.0,16358.0
1,al,"southside,al",3.0,2.0,1474.0,0.67034,186700.0,1381.0,1.0,Hot,...,0.0,-1,"$67,812.73","$58,943.92",115.05%,69642.0,79134.0,,,9554.0
2,al,"robertsdale,al",3.0,2.0,1800.0,3.20000,,,259900.0,Hot,...,1.0,-1,"$79,155.41","$77,884.76",101.63%,29479.0,36363.0,67084.0,172640.0,7189.0
3,al,"gulf shores,al",2.0,2.0,1250.0,,,,342500.0,Hot,...,1.0,-1,"$79,155.41","$77,884.76",101.63%,56013.0,31948.0,65583.0,300000.0,16193.0
4,al,"chelsea,al",3.0,3.0,2224.0,0.26000,336200.0,1932.0,335000.0,Hot,...,-1.0,-1,"$85,691.03","$98,419.23",87.07%,44179.0,41526.0,78399.0,71839.0,16193.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2249310,wa,"richland,wa",4.0,2.0,3600.0,0.33000,,,359900.0,Cold,...,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0,62821.0
2249311,wa,"richland,wa",3.0,2.0,1616.0,0.10000,,,350000.0,Cold,...,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0,62821.0
2249312,wa,"richland,wa",6.0,3.0,3200.0,0.50000,,,440000.0,Cold,...,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0,62821.0
2249313,wa,"richland,wa",2.0,1.0,933.0,0.09000,,,179900.0,Cold,...,0.0,-1,"$72,571.13","$83,393.68",87.02%,35964.0,87680.0,115261.0,235998.0,62821.0


## Drop Nan Values as well as other unnecessary columns

In [423]:
# Simplify data by dropping columns we're not interested in
merged_df = merged_df.drop(columns=['MarketEstimate', 'RentEstimate', '%CvgCityPark', 'Diversity Rank (Race)', 'Diversity Rank (Gender)', 'AVG C2I', 'MeanIncome', 'MedianIncome'])

# Drop NaN Vals
merged_df = merged_df.dropna()

merged_df

Unnamed: 0,State,City,Bedroom,Bathroom,Area,LotArea,Price,Temperature,2016 Crime Rate,Unemployment,AQI%Good,WaterQualityVPV,Cost of Living,2022 Median Income,Population
0,al,"saraland,al",4.0,2.0,1614.0,0.38050,239900.0,Hot,47/1000,3.35%,80.94%,1.0,"$71,947.38","$62,409.46",16358.0
1,al,"southside,al",3.0,2.0,1474.0,0.67034,1.0,Hot,43/1000,3.12%,80.94%,0.0,"$67,812.73","$58,943.92",9554.0
2,al,"robertsdale,al",3.0,2.0,1800.0,3.20000,259900.0,Hot,18/1000,2.41%,80.94%,1.0,"$79,155.41","$77,884.76",7189.0
4,al,"chelsea,al",3.0,3.0,2224.0,0.26000,335000.0,Hot,16/1000,1.87%,80.94%,-1.0,"$85,691.03","$98,419.23",16193.0
6,al,"montgomery,al",3.0,2.0,1564.0,8712.00000,151000.0,Hot,47/1000,3.17%,80.94%,1.0,"$74,899.78","$64,886.16",196986.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2249310,wa,"richland,wa",4.0,2.0,3600.0,0.33000,359900.0,Cold,23/1000,5.31%,92.89%,0.0,"$72,571.13","$83,393.68",62821.0
2249311,wa,"richland,wa",3.0,2.0,1616.0,0.10000,350000.0,Cold,23/1000,5.31%,92.89%,0.0,"$72,571.13","$83,393.68",62821.0
2249312,wa,"richland,wa",6.0,3.0,3200.0,0.50000,440000.0,Cold,23/1000,5.31%,92.89%,0.0,"$72,571.13","$83,393.68",62821.0
2249313,wa,"richland,wa",2.0,1.0,933.0,0.09000,179900.0,Cold,23/1000,5.31%,92.89%,0.0,"$72,571.13","$83,393.68",62821.0


# SECTION 5: UPLOADING TO SQL

In [424]:
from sqlalchemy import create_engine
from dotenv import load_dotenv
from supabase import create_client, Client
import os

# Load environment variables
load_dotenv()

# Retrieve database credentials from .env file
DB_URL = os.getenv("DATABASE_URL")
SUPABASE_KEY = os.getenv("SUPABASE_KEY")
SUPABASE_URL = os.getenv("SUPABASE_URL")
DB_PASS = os.getenv("DATABASE_PASS")

# Initialize Supabase client
supabase: Client = create_client(SUPABASE_URL, SUPABASE_KEY)

# Create SQLAlchemy engine
engine = create_engine(DB_URL, pool_size=5, max_overflow=10)

try:
    with engine.connect() as conn:
        print("Connected to the database successfully!")
except Exception as e:
    print(f"Error: {e}")


try:
    # merged_df.to_sql('Housing Data', engine, if_exists='replace', index=False)
    print("Data uploaded successfully!")
except Exception as e:
    print(f"Error uploading data: {e}")

# Upload DataFrame to SQL table
# merged_df.to_sql("Housing Data", engine, if_exists="replace", index=False)

print("DataFrame successfully uploaded to the SQL database.")

Connected to the database successfully!
Data uploaded successfully!
DataFrame successfully uploaded to the SQL database.


# SECTION 6: KNN

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Select features for clustering
feature_cols = ['Bedroom', 'Bathroom', 'Area', 'LotArea', 'Price', '2022 Median Income', 'Temperature', 'Population']
quality_cols = ['AQI%Good', 'WaterQualityVPV', 'Unemployment', '2016 Crime Rate', 'Cost of Living']
# Note: Curious about whether or not we should have Cost of Living be an inputted feature variable or if it should just be something that's generally minimized

# Clean currency columns
for col in ['Price', '2022 Median Income', 'Cost of Living']:
    merged_df[col] = merged_df[col].replace('[\$,]', '', regex=True).astype(float)

# Clean crime rate
merged_df['2016 Crime Rate'] = (
    merged_df['2016 Crime Rate']
    .astype(str)
    .str.extract(r'(\d+)/(\d+)')
    .astype(float)
    .apply(lambda row: row[0] / row[1] if pd.notna(row[0]) and pd.notna(row[1]) else np.nan, axis=1)
)

# Clean quality columns
for col in quality_cols:
    merged_df[col] = (
        merged_df[col]
        .astype(str)
        .str.replace('%', '', regex=False)
        .str.replace(',', '', regex=False)
        .replace({'N/A': np.nan, 'unknown': np.nan, 'Missing': np.nan})
    )
    merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce')

# Map temperature to numeric
temp_mapping = {'Cold': 0, 'Medium': 1, 'Hot': 2}
merged_df['Temperature'] = merged_df['Temperature'].map(temp_mapping)

# Save dataframe as it is
merged_df2 = merged_df.copy()

# Preprocess: scale the features
scaler = StandardScaler()
X = scaler.fit_transform(merged_df[feature_cols])
# X = merged_df[feature_cols].values

# Get unique cities and map them
cities = merged_df['City'].unique()
city_to_index = {city: idx for idx, city in enumerate(cities)}
index_to_city = {idx: city for city, idx in city_to_index.items()}

# Assign city labels
y = merged_df['City'].map(city_to_index)

# Reduce dimensions for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

def recommend_city(user_input, visualize=True):
    # user_input: includes Bedroom, Bathroom, Area, LotArea, Price, Temperature, 2022 Median Income, Population
    temp_mapping = {'Cold': 0, 'Medium': 1, 'Hot': 2}
    
    user_features = np.array([
        user_input['Bedroom'],
        user_input['Bathroom'],
        user_input['Area'],
        user_input['LotArea'],
        user_input['Price'],
        user_input['2022 Median Income'],
        temp_mapping[user_input['Temperature']],  # map temp string to number
        user_input['Population']
    ]).reshape(1, -1)

    user_features_scaled = scaler.transform(user_features)

    # Calculate Euclidean distances
    distances = np.linalg.norm(X - user_features_scaled, axis=1)

    # Find k nearest neighbors
    k = 5
    top_k_indices = np.argsort(distances)[:k]

    # Look up corresponding cities
    neighbor_cities = y.iloc[top_k_indices].values

    # Tally up most common city
    unique, counts = np.unique(neighbor_cities, return_counts=True)
    candidate_cities = [index_to_city[idx] for idx in unique]

    # Filter original df for these candidates
    candidates_df = merged_df[merged_df['City'].isin(candidate_cities)].copy()

    # Score candidates based on quality metrics (Increasing or decreasing distance based on QoL features)
    candidates_df['QualityScore'] = (
        candidates_df['AQI%Good'] + candidates_df['WaterQualityVPV']
        - candidates_df['Unemployment'] * 2 - candidates_df['2016 Crime Rate'] * 2
        - candidates_df['Cost of Living'] * 1.5
    )

    # Group by city and take the best average score
    city_scores = candidates_df.groupby('City')['QualityScore'].mean()

    # Best city
    best_city = city_scores.idxmax()

    visualize = False
    if visualize:
        # Visualize
        user_2d = pca.transform(user_features_scaled)

        plt.figure(figsize=(10, 7))
        sample_size = 5000
        if len(merged_df) > sample_size:
            sampled_indices = np.random.choice(len(merged_df), size=sample_size, replace=False)
            sampled_df = merged_df.iloc[sampled_indices]
            X_2d_sampled = X_2d[sampled_indices]
        else:
            sampled_df = merged_df
            X_2d_sampled = X_2d

        for city in sampled_df['City'].unique():
            mask = sampled_df['City'] == city
            plt.scatter(X_2d_sampled[mask, 0], X_2d_sampled[mask, 1], label=city, alpha=0.5)

        plt.scatter(user_2d[0, 0], user_2d[0, 1], color='black', marker='X', s=200, label='User Input')

        for idx in top_k_indices:
            plt.plot([user_2d[0, 0], X_2d[idx, 0]], [user_2d[0, 1], X_2d[idx, 1]], 'k--', alpha=0.7)

        plt.title(f'Recommended City: {best_city}')
        plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.xlabel('PCA Component 1')
        plt.ylabel('PCA Component 2')
        plt.grid(True)
        plt.tight_layout()
        plt.show()

    return best_city

# Example user input
user_input = {
    'Bedroom': 3,
    'Bathroom': 2,
    'Area': 1500,
    'LotArea': 5000,
    'Price': 1000000,
    '2022 Median Income': 90000,
    'Temperature': 'Cold',   # Now passed as text
    'Population': 500000     # New: user estimates size of city they want
}

best_city = recommend_city(user_input, visualize=True)
print(f"Recommended city: {best_city}")s

  merged_df[col] = merged_df[col].replace('[\$,]', '', regex=True).astype(float)


Recommended city: fargo,nd




# SECTION 7: TESTING ACCURACY

Note: This first trial tests the accuracy of the features for each row in the testing data returning the specific city that is actually assigned to them. Since our model does not take in quality of life features as inputted Features, but instead tries to pick a city that generally minimizes or maximizes them (depending on the specific QoL feature) by default, this will return a relatively low accuracy since certain cities will have lower quality of life ratings and so will not be returned by the Model.

In [426]:
from sklearn.model_selection import train_test_split

# Split data (assuming Temperature is already part of X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Important: fix randomness
np.random.seed(42)

# Sample 500 test points
sample_size = 500
sample_indices = np.random.choice(len(X_test), size=sample_size, replace=False)

X_test_sampled = X_test[sample_indices]        # numpy array indexing
y_test_sampled = y_test.iloc[sample_indices]   # pandas Series positional indexing

# Recommendation function — no temperature penalty needed
def recommend_city_from_train(user_features_scaled, X_train, y_train, k=5):
    distances = np.linalg.norm(X_train - user_features_scaled, axis=1)
    
    top_k_indices = np.argsort(distances)[:k]
    neighbor_cities = y_train.iloc[top_k_indices].values
    unique, counts = np.unique(neighbor_cities, return_counts=True)
    best_city_idx = unique[np.argmax(counts)]
    return best_city_idx

# Evaluate accuracy on the sampled test set
correct = 0
total = len(X_test_sampled)

for i in range(total):
    user_features_scaled = X_test_sampled[i].reshape(1, -1)  # numpy indexing
    predicted_idx = recommend_city_from_train(user_features_scaled, X_train, y_train)
    actual_idx = y_test_sampled.iloc[i]  # pandas indexing

    if predicted_idx == actual_idx:
        correct += 1

accuracy = correct / total
print(f"Sampled model accuracy (on {sample_size} points): {accuracy:.2f}")

Sampled model accuracy (on 500 points): 0.57


# SECTION 8: TESTING ACCURACY WITHOUT SIMPLY MAXIMIZING QUALITY OF LIFE FEATURES

Now, we tinker with constructing our model to recieve QoL features as input, so that we can see how good it is at generally predicting data. If this model can return a high accuracy, we know our KNN Algorithm is working properly to at least predict the cities accurately. However, for the final model, we will change it back to maximizing or minimizing certain QoL values because we are not just trying to predict a city but are trying to give the user the best option.

In [427]:
from sklearn.model_selection import train_test_split

# Reset dataframe
merged_df = merged_df2.copy()

# Step 1: Prepare feature set that includes quality of life features + temperature + population
full_feature_cols = ['Bedroom', 'Bathroom', 'Area', 'LotArea', 'Price', '2022 Median Income',
                     'AQI%Good', 'WaterQualityVPV', 'Unemployment', '2016 Crime Rate', 'Cost of Living', 
                     'Temperature', 'Population'] 

# Preprocess: scale the full features
scaler_full = StandardScaler()
X_full = scaler_full.fit_transform(merged_df[full_feature_cols])
# X_full = merged_df[full_feature_cols].values

# Labels stay the same
y_full = merged_df['City'].map(city_to_index)

# Step 2: Train/test split
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

# Important: fix randomness
np.random.seed(42)

# Optional: sample 500 points to keep speed reasonable
sample_size = 500
sample_indices = np.random.choice(len(X_test_full), size=sample_size, replace=False)
X_test_sampled_full = X_test_full[sample_indices]
y_test_sampled_full = y_test_full.iloc[sample_indices]

# Step 3: Define a pure recommend function
def pure_recommend_city(user_features_scaled, X_train, y_train, k=5):
    distances = np.linalg.norm(X_train - user_features_scaled, axis=1)

    top_k_indices = np.argsort(distances)[:k]
    neighbor_cities = y_train.iloc[top_k_indices].values
    unique, counts = np.unique(neighbor_cities, return_counts=True)
    best_city_idx = unique[np.argmax(counts)]
    return best_city_idx

# Step 4: Evaluate pure matching accuracy
correct = 0
total = len(X_test_sampled_full)

for i in range(total):
    user_features_scaled = X_test_sampled_full[i].reshape(1, -1)
    predicted_idx = pure_recommend_city(user_features_scaled, X_train_full, y_train_full)
    actual_idx = y_test_sampled_full.iloc[i]

    if predicted_idx == actual_idx:
        correct += 1

pure_accuracy = correct / total
print(f"Pure Matching Model Accuracy: {pure_accuracy:.2f}")

Pure Matching Model Accuracy: 0.80


## Now, we try to tune the model for higher accuracy by increasing the weights of important, city-defining features, dropping noisy features, increasing the value of k, and other tweaks

Once we find the modifications that maximize the model's accuracy without QoL optimization, we can re-implement these tweaks into the regular model which DOES maximize QoL Values, as we know that these results will be more accurate to the user's inputs while still maximizing better quality of life.

In [428]:
from sklearn.model_selection import train_test_split

# Reset dataframe
merged_df = merged_df2.copy()

# Step 1: Prepare feature set that includes quality of life features + temperature + population
full_feature_cols = ['Bedroom', 'Bathroom', 'Area', 'LotArea', 'Price', '2022 Median Income',
                     'AQI%Good', 'WaterQualityVPV', 'Unemployment', '2016 Crime Rate', 'Cost of Living', 
                     'Temperature', 'Population']

# merged_df["2022 Median Income"] *= 1
# merged_df["Cost of Living"] *= 2.5
# merged_df["Population"] *= 2
# merged_df["Temperature"] *= 2.5 
#merged_df['Bedroom'] *= 1.5
#merged_df['Bathroom'] *= 1.5
#merged_df['Area'] *= 5
# merged_df['LotArea'] *= 1.5

# Preprocess: scale the full features
scaler_full = StandardScaler()
X_full = scaler_full.fit_transform(merged_df[full_feature_cols])
# X_full = merged_df[full_feature_cols].values

# Weigh features to prioritize more important, impactful ones
# feature_weights = np.array([1, 1, 1, 1, 1, 2.5, 1, 1, 1, 1, 2, 3, 2])
feature_weights = np.array([1, 1, 1, 1.5, 1, 4, 1, 1, 1, 1, 2.5, 3, 3])


# Labels stay the same
y_full = merged_df['City'].map(city_to_index)

# Step 2: Train/test split
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

# Important: fix randomness
np.random.seed(42)

# Optional: sample 500 points to keep speed reasonable
sample_size = 500
sample_indices = np.random.choice(len(X_test_full), size=sample_size, replace=False)
X_test_sampled_full = X_test_full[sample_indices]
y_test_sampled_full = y_test_full.iloc[sample_indices]

# Step 3: Define a pure recommend function (NO temperature penalty anymore)
def pure_recommend_city(user_features_scaled, X_train, y_train, k=5):
    weighted_diff = (X_train - user_features_scaled) * feature_weights
    distances = np.linalg.norm(weighted_diff, axis=1)

    top_k_indices = np.argsort(distances)[:k]
    neighbor_cities = y_train.iloc[top_k_indices].values
    unique, counts = np.unique(neighbor_cities, return_counts=True)
    best_city_idx = unique[np.argmax(counts)]
    return best_city_idx

# Step 4: Evaluate pure matching accuracy
correct = 0
total = len(X_test_sampled_full)

for i in range(total):
    user_features_scaled = X_test_sampled_full[i].reshape(1, -1)
    predicted_idx = pure_recommend_city(user_features_scaled, X_train_full, y_train_full)
    actual_idx = y_test_sampled_full.iloc[i]

    if predicted_idx == actual_idx:
        correct += 1

pure_accuracy = correct / total
print(f"Pure Matching Tuned Model Accuracy: {pure_accuracy:.2f}")

Pure Matching Tuned Model Accuracy: 0.85


## Now, we reconstruct the KNN with these modifications to get our final, most accurate KNN which prioritizes Quality of Life values

In [430]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Reset dataframe
merged_df = merged_df2.copy()

# Select features for clustering
feature_cols = ['Bedroom', 'Bathroom', 'Area', 'LotArea', 'Price', '2022 Median Income', 'Temperature', 'Population']
quality_cols = ['AQI%Good', 'WaterQualityVPV', 'Unemployment', '2016 Crime Rate', 'Cost of Living']
# Note: Curious about whether or not we should have Cost of Living be an inputted feature variable or if it should just be something that's generally minimized

# Preprocess: scale the features
scaler = StandardScaler()
X = scaler.fit_transform(merged_df[feature_cols])
# X = merged_df[feature_cols].values

# Weigh features to prioritize more important, impactful ones
feature_weights = np.array([1, 1, 1, 1.5, 1, 4, 3, 3])

# Get unique cities and map them
cities = merged_df['City'].unique()
city_to_index = {city: idx for idx, city in enumerate(cities)}
index_to_city = {idx: city for city, idx in city_to_index.items()}

# Assign city labels
y = merged_df['City'].map(city_to_index)

# Reduce dimensions for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

def recommend_city(user_input, visualize=True):
    # user_input: includes Bedroom, Bathroom, Area, LotArea, Price, Temperature, 2022 Median Income, Population
    temp_mapping = {'Cold': 0, 'Medium': 1, 'Hot': 2}
    
    user_features = np.array([
        user_input['Bedroom'],
        user_input['Bathroom'],
        user_input['Area'],
        user_input['LotArea'],
        user_input['Price'],
        user_input['2022 Median Income'],
        temp_mapping[user_input['Temperature']],  # map temp string to number
        user_input['Population']
    ]).reshape(1, -1)

    user_features_scaled = scaler.transform(user_features)

    # Calculate Euclidean distances
    weighted_diff = (X_train - user_features_scaled) * feature_weights
    distances = np.linalg.norm(weighted_diff, axis=1)

    # Find k nearest neighbors
    k = 5
    top_k_indices = np.argsort(distances)[:k]

    # Look up corresponding cities
    neighbor_cities = y.iloc[top_k_indices].values

    # Tally up most common city
    unique, counts = np.unique(neighbor_cities, return_counts=True)
    candidate_cities = [index_to_city[idx] for idx in unique]

    # Filter original df for these candidates
    candidates_df = merged_df[merged_df['City'].isin(candidate_cities)].copy()

    # Score candidates based on quality metrics (Increasing or decreasing distance based on QoL features)
    candidates_df['QualityScore'] = (
        candidates_df['AQI%Good'] + candidates_df['WaterQualityVPV']
        - candidates_df['Unemployment'] * 2
        - candidates_df['2016 Crime Rate'] * 2
        - candidates_df['Cost of Living'] * 1.5
    )

    # Group by city and take the best average score
    city_scores = candidates_df.groupby('City')['QualityScore'].mean()

    # Best city
    best_city = city_scores.idxmax()

    visualize = False
    if visualize:
        # Visualize
        user_2d = pca.transform(user_features_scaled)

        plt.figure(figsize=(10, 7))
        sample_size = 5000
        if len(merged_df) > sample_size:
            sampled_indices = np.random.choice(len(merged_df), size=sample_size, replace=False)
            sampled_df = merged_df.iloc[sampled_indices]
            X_2d_sampled = X_2d[sampled_indices]
        else:
            sampled_df = merged_df
            X_2d_sampled = X_2d

        for city in sampled_df['City'].unique():
            mask = sampled_df['City'] == city
            plt.scatter(X_2d_sampled[mask, 0], X_2d_sampled[mask, 1], label=city, alpha=0.5)

        plt.scatter(user_2d[0, 0], user_2d[0, 1], color='black', marker='X', s=200, label='User Input')

        for idx in top_k_indices:
            plt.plot([user_2d[0, 0], X_2d[idx, 0]], [user_2d[0, 1], X_2d[idx, 1]], 'k--', alpha=0.7)

        plt.title(f'Recommended City: {best_city}')
        plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
        plt.xlabel('PCA Component 1')
        plt.ylabel('PCA Component 2')
        plt.grid(True)
        plt.tight_layout()
        plt.show()

    return best_city

# Example user input
user_input = {
    'Bedroom': 3,
    'Bathroom': 2,
    'Area': 2500,
    'LotArea': 5000,
    'Price': 1000000,
    '2022 Median Income': 160000,
    'Temperature': 'Cold',   # Now passed as text
    'Population': 500000     # New: user estimates size of city they want
}

best_city = recommend_city(user_input, visualize=True)
print(f"Recommended city: {best_city}")

Recommended city: natchez,ms




# SECTION 9: CREATING A KNN FOR EVERY POSSIBLE COMBINATION OF USER INPUTTED FEATURES

## INFO FOR MIDWAY REPORT:

- Essentially what we're gonna do here is, after deciding which features we want to make Mandatory for the user to input, we will create a model for each different combination of user inputted features, so that the result that our entire machine/program gives them is accurate to what they inputted. For example, if they only specified temperature, price, and square footage, we should give them their best city based only on those fields, rather than simply choosing a default value for their unfilled fields and running that on our first model. This would greatly limit the possible outcomes of the model to be cities that match the default fields we would select for when a user leaves something blank. So we're not gonna do this. We probably want less than 30 or 40 models, cuz otherwise it gets redundant, so maybe we'll only leave 5 or 6 features optional.

- This is imporant because this specific implementation is how we are overcoming the problem of a MODEL RUNNING WHEN INPUT FEATURES ARE MISSING, which is a large part of our project and is one of the main questions we sought to answer. We also were interested in removing outliers, which we could propose involves modifying the data to look at things like the average house size/num bedrooms found in a city, etc, because there may be outliers in each city that are massively large mansions even though the city is generally poor. So the solution would be essentially changing the data to represent averages and medians across cities, specifically when it involves the house-features (since the city features are already the same for all house entries which are in the same city, since we filled out the city-related columns just based off of the city that the house was in). This is something we can suggest we will look more into going forward.

- Finally, we should also talk about what else we do to expand past KNNs, since we are almost entirely done with the KNNs. We have already proposed that we look into other types of models that could essentially create the same tool we're trying to build based off of our data, and then investigate how to deal with missing features as well as outliers in both of then. I think that we should say we're interested in doing a Linear Regression model, as well as trying random forest. I think Linear Regression would be stronger to have on our Midway report, even if we're not gonna do it. I also think we should say we're interested in trying a neural network.

- OOH - BOOM. I THINK I GOT IT. WE SHOULD SAY THAT, ONCE WE MAKE A FEW MODELS WHICH DO THE SAME THING, USING DIFFERENT ML TACTICS, WE SHOULD PLAN TO COMPARE THEM BY SEEING WHAT DIFFERENT CITIES THEY COME UP WITH WHEN GIVEN THE SAME INPUT AFTER BEING TRAINED ON THE SAME DATA. THIS WOULD BE REALLY INTERESTING, BECAUSE THEN WE COULD CREATE A SIMPLE TESTING ALGORITHM WHICH FINDS OUT WHICH MODEL GAVE A CITY MORE ACCURATE TO THE USER'S PREFERENCES. THEN WE CAN THEORIZE WHY ONE IS BETTER THAN THE OTHER, AND BOOM THAT'S A GOOD PROJECT.

- Also, in the midway report, make sure to specify how we've divided our data into input features from the user and just general QoL features that the model either tries to maximize or minimize in the cities it outputs, regardless of the user inputted data. This feature makes our approach a bit more unique compared to just any other KNN or ML model which provides a predicted output based on features. Otherwisem, just look through all the comments and headers in this document and try to summarize what has been done. This document pretty much contains everything that has been done for the project LMAO.

- For the literature section, we need to do more work (Adhi was also right during that meeting, we essentially do have to BS it since most of my info has come from chatgpt), so we need to find some resources that we can pretend we got inspiration and insight from in terms of coding our project. I used freecodecamp a bit to learn about this stuff, so i can add that in.

- Dividing up the work section should just be like one sentence, and our future plans should involve what i already specified here pretty much.