<h3>Tool assisted housing prices</h3>

The goal of this notebook is to work through this project a la typical kaggle style approach with a heavy sprinkle of chatGPT's help.
The primary objective is to use chatGPT as a tool.
Some possible questions to answer at the end:

    1. what problems did you have using chatgpt
    2. how did you prompt chatgpt
    3. where did you believe chatgpt helped the most in the process
    4. how would you rate chatgpt's effectiveness in its answers

Below is a list of tasks to accomplish.
I want chatGPT to assist in each of the tasks.
I'll document my prompts (somehow, no idea yet) and answers:

    1. Data setup
    2. EDA
    3. Pipeline
    4. Modeling
    5. Feature engineering
    6. Ensemble

In [1]:
import pandas as pd
from IPython.display import display, HTML # This import comes from a chatGPT response!

In [2]:
df_train = pd.read_csv('data/train.csv')

<h5><b>Data setup and EDA</h5></b>

Ideally we use chatGPT for each step and see how far it can take us before we have to grab the steering wheel. I'm curious if I can just paste in the output table for stuff like null values and summary stats and get chatGPTs opinion on what to do next! We'll see if I can and if I agree with the results.

    Get an idea of the data, summary stats
    What do the nulls look like?
    What is the shape of the data?
    May be worth splitting up a view for numerics and categoricals


In [3]:
df_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Upon using `df_train.describe()` I am met with a monstrous table display.
We'll use chatgpt to circumvent this.
I'll upload a link to the chat for those interested.

So my first prompt was:
We're doing some analysis with python inside of a jupyter notebook. Can you create a way to view tables in a more concise manner?

    It gave me useful info as far as installing pandas goes but doesn't quite give me what I want.

The second prompt:
I already have pandas, thanks. I would like my tables in the output to be more concise. Can you help me out with that?

    This one was more precise. It showed me how I can change my max_rows, max_columns, max_colwidth, and precision but I don't want to truncate my data.

Third prompt:
I don't want to truncate my data, I would like to still view everything without it taking up a lot of space. Is there a function you can provide me to achieve this?

    This one seems interesting. I'll try it out and see what that looks like in the next block

In [4]:
# Set the display width to fit your Jupyter Notebook's cell
pd.set_option('display.width', None)

# Set the maximum number of columns to display
pd.set_option('display.max_columns', None)

# Set the maximum width for each column
pd.set_option('display.max_colwidth', None) # had to change -1 to None, chatGPT made a mistake here

# Display the DataFrame as a string without truncation
print(df_train.describe().to_string())

                Id   MSSubClass  LotFrontage        LotArea  OverallQual  OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF     1stFlrSF     2ndFlrSF  LowQualFinSF    GrLivArea  BsmtFullBath  BsmtHalfBath     FullBath     HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd   Fireplaces  GarageYrBlt   GarageCars   GarageArea   WoodDeckSF  OpenPorchSF  EnclosedPorch    3SsnPorch  ScreenPorch     PoolArea       MiscVal       MoSold       YrSold      SalePrice
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000  1460.000000  1460.000000  1460.000000  1460.000000  1460.000000   1460.000000  1460.000000   1460.000000   1460.000000  1460.000000  1460.000000   1460.000000   1460.000000   1460.000000  1460.000000  1379.000000  1460.000000  1460.000000  1460.000000  1460.000000    1460.000000  1460.000000  1460.000000  1460.000000   1460.000000  1460.000000 

I will say that it does look a lot more concise but it looks pretty ugly. Time to get more precise on my prompting.

4th prompt: This looks a lot better. However, I would like a function that allows me to use this each time and show case the data in scrollable html table.

    Holy smokes, it actually gave a function to use. However, at a glance I can see that I need more arguments for my use cases and it still has a -1 instead of None. Before we use this let's see if chatGPT can fix that mistake after I mention it and allow for more inputs.

5th prompt: Thanks! I noticed that the -1 for pd.set_option results in this error: ValueError: Value must be a nonnegative integer or None.
Can you fix that value error and allow for more inputs such as table name and table ID

    I'm honestly blown away with the response this one gave me. It even goes as far as how to use it and why my extra inputs are useful. Cool. Not sure I like the 1000 in place of -1, I'll stick with None for now. Tool assisted, I still gotta put in work here.

In [5]:
# Lets see this chatGPT code in action


def display_scrollable_table(df, table_name=None, table_id=None):
    # Set the display width to fit your Jupyter Notebook's cell
    pd.set_option('display.width', None)

    # Set the maximum number of columns to display
    pd.set_option('display.max_columns', None)

    # Set a large positive value for column width to prevent truncation
    pd.set_option('display.max_colwidth', None) # NOTE: Like I mentioned I prefer None over 1000, changed

    # Generate HTML table
    html_table = df.to_html(index=True, classes='scrollable-table', table_id=table_id) # NOTE: The index being false is a no go for me, changed

    # Create HTML with styling and scrollable container
    html = f"""
    <div style="overflow-x: auto;">
        <h4>{table_name}</h4>
        {html_table}
    </div>
    """

    # Display the HTML table
    display(HTML(html))

In [6]:
display_scrollable_table(df_train.describe().T, "Summary Stats")

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


Okay, so my last prompt on this function: Good work, can we make the window size smaller and scrollable both horizontally and vertically?

    Chatgpt gives me some extra inputs to customize the size of the output window. It gave me 300px on height by default and 100% on width by default. I'll iterate on the next cell block to find what works best for me

In [14]:
def display_scrollable_table(df, table_name=None, table_id=None, width='75%', height='300px'):
    # Set the display width to fit your Jupyter Notebook's cell
    pd.set_option('display.width', None)

    # Set the maximum number of columns to display
    pd.set_option('display.max_columns', None)

    # Set a large positive value for column width to prevent truncation
    pd.set_option('display.max_colwidth', None) # NOTE: again I needed to change this to None but its nice to see that it kept it in the conversation

    # Generate HTML table
    html_table = df.to_html(index=True, classes='scrollable-table', table_id=table_id) # NOTE: same change on index being true here. I didn't mention it in the prompt

    # Create HTML with styling and scrollable container
    html = f"""
    <div style="width: {width}; height: {height}; overflow: auto;">
        <h4>{table_name}</h4>
        {html_table}
    </div>
    """

    # Display the HTML table
    display(HTML(html))


In [16]:
display_scrollable_table(df_train.describe(), "Summary Stats")
display_scrollable_table(df_train.describe().T, "Summary Stats")

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


Hey this is great, I set the column to 75% since I don't want to look at the extreme right of my screen. The 300px works great for me. Good work chatGPT. I should consider using co-pilot as well for all the tool assisted help.

I think I've digressed far enough on this unimportant function but it was neat to see how chatGPT and I could iterate on this. Next I want to feed in the summary stats output after I talk about the purpose of this project.

<h5><b>IDEA: Create a "small" app that sends the same prompt to Bard, co-pilot, chatgpt and returns their results (possibly any other LLMs as well).

Prompt: To give you more background on this project I want to use the data from kaggle's housing prices and run through a typical data science pipeline using your help! The goal of this dataset is to predict the sales price. Here is a link to those datasets: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

What are some things we should look for in our data exploration analysis? 

    This was the output!

Exploring the data is an essential step in any data science pipeline. It helps you understand the characteristics of the dataset, identify patterns, and gain insights that can guide your subsequent analysis and modeling. Here are some key aspects to consider during your data exploration analysis for the Kaggle House Prices dataset:

1. **Data Overview**: Begin by loading the dataset and obtaining a high-level understanding of its structure. Explore the dimensions (number of rows and columns) of the dataset and review the available features (columns) and their data types.

2. **Target Variable**: In this case, the target variable is the "SalePrice" column. Examine its distribution to understand its range, skewness, and potential outliers. Plotting a histogram or a box plot can provide initial insights into the distribution of the target variable.

3. **Feature Analysis**: Explore each feature (column) in the dataset to understand its nature and relationship with the target variable. Consider the following:

   - **Numerical Features**: Analyze numerical features such as "GrLivArea," "LotArea," "OverallQual," etc. Examine their distributions, identify outliers, and check for any potential correlations with the target variable.
   
   - **Categorical Features**: Investigate categorical features like "MSZoning," "Neighborhood," "Exterior1st," etc. Evaluate their unique categories, frequency distributions, and potential impact on the target variable.
   
   - **Missing Values**: Identify any missing values in the dataset and determine the extent of missingness for each feature. Analyze the potential reasons behind missing values and decide how to handle them in subsequent steps.

4. **Correlations**: Calculate the correlation between the numerical features and the target variable. Identify features that have a strong positive or negative correlation, as they may be good predictors of the target variable. Visualize the correlations using a correlation matrix or a heatmap.

5. **Feature Relationships**: Examine the relationships between different features. Look for potential correlations or dependencies between features that can aid in feature selection or engineering. Scatter plots, pair plots, or other visualizations can be helpful for this analysis.

6. **Outliers**: Identify any outliers in the dataset. Determine whether these outliers are erroneous data points or represent genuine observations. Depending on the nature of the outliers, decide whether to remove or transform them.

7. **Data Preprocessing**: Based on your analysis, determine the appropriate preprocessing steps for the dataset. This may include handling missing values, encoding categorical variables, scaling numerical features, or applying transformations to skewed data.

These are some initial aspects to consider during your data exploration analysis. As you delve deeper into the dataset, additional questions and insights may arise. Remember to document your observations, as they will guide your subsequent data cleaning, feature engineering, and modeling steps.

<h1>Holy shit.</h1>

It was able to list out a general outline of what I need to do first. Seems like I skipped some steps in my overview! Tool assitant already putting in work.
ChatGPT even knows what the target variable is, whether it knew through its own training data or actually read through the link, idk. But it also gave me some good ideas that I wasn't about to consider due to my laziness, such as skewnewss, outliers, even plotting a distribution to determine how normal it is. Seems I'll need to pull some memories from my stat classes! This is good though, and I'll see if chatGPT can help me in each of these steps. This notebook is about to get really long.

At least I got number 3 on point. Split up the objects/numerics and look at null values.
Correlations is definitely a good idea. The outliers part is good too, we'll have to plot some data and get a basic idea of how linear everything is, or isn't. Some of these later steps are what I would consider feature engineering so I may skip those parts until we get to that point. Anyway, I'll run through the basic steps like shape and describe and send that to chatGPT. Then we'll ask about bullets 2 and 3.

In [17]:
display(df_train.shape)
display_scrollable_table(df_train.describe().T, "Summary Stats")

(1460, 81)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.610009,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0


Prompt: We'll refer back to that in a bit. What insights can you give me based on the shape of the data being (1460, 81) and the summary stats being: [plugged in the output from above]

    I'll just paste what it gave. Pretty cool

Based on the shape of the data being (1460, 81) and the summary statistics you provided, here are some initial insights:

1. **Dataset Size**: The dataset contains 1,460 instances (rows) and 81 features (columns). This indicates a relatively moderate-sized dataset with a considerable number of features to work with.

2. **Target Variable (SalePrice)**:
   - The target variable, "SalePrice," has a mean of approximately 180,921.20 with a standard deviation of 79,442.50. This indicates some variability in the sale prices of the houses.
   - The minimum sale price is 34,900, while the maximum sale price is 755,000.
   - The distribution of the target variable can provide insights into the range and spread of house prices.

3. **Numerical Features**:
   - The dataset contains various numerical features, such as "LotFrontage," "LotArea," "OverallQual," "OverallCond," "YearBuilt," etc.
   - These features exhibit different ranges, distributions, and scales. Analyzing their summary statistics can provide a sense of their distributions and potential outliers.

4. **Missing Values**:
   - Some features, such as "LotFrontage," "MasVnrArea," and "GarageYrBlt," have missing values as indicated by the difference between the count of instances and the expected count (1460).
   - Identifying and handling missing values is crucial to ensure the quality and integrity of the dataset during subsequent analysis.

5. **Categorical Features**:
   - The summary statistics you provided mainly focus on numerical features. However, the dataset likely contains several categorical features as well. It is important to explore these categorical features separately to understand their unique categories, distributions, and potential relationships with the target variable.

6. **Temporal Features**:
   - Features such as "YearBuilt," "YearRemodAdd," "GarageYrBlt," "YrSold," etc., represent temporal information related to the houses.
   - Analyzing these features can provide insights into the age of the houses, remodeling trends, and the years in which the houses were sold.

These are just initial observations based on the dataset shape and summary statistics you provided. To gain a deeper understanding of the data, it is important to visualize the distributions, explore relationships between features, handle missing values, and conduct further exploratory analysis.

In [18]:
# We'll take its advice and split the cats out and rerun the summary stats and send it back
df_train_cats = df_train.select_dtypes(include=[object])
display_scrollable_table(df_train_cats.describe().T, "Summary Stats for Categoricals")

Unnamed: 0,count,unique,top,freq
MSZoning,1460,5,RL,1151
Street,1460,2,Pave,1454
Alley,91,2,Grvl,50
LotShape,1460,4,Reg,925
LandContour,1460,4,Lvl,1311
Utilities,1460,2,AllPub,1459
LotConfig,1460,5,Inside,1052
LandSlope,1460,3,Gtl,1382
Neighborhood,1460,25,NAmes,225
Condition1,1460,9,Norm,1260


Prompt: Thanks, I took your advice and reran the summary statistics for categorical variables. What insights can you give from this: [previous output]

Based on the summary statistics of the categorical variables, here are some insights:

1. **Variable Cardinality**:
   - The count column indicates the number of non-missing values for each categorical variable.
   - The unique column indicates the number of unique categories within each variable.
   - Analyzing variable cardinality helps identify the diversity and variation within categorical features.

2. **Frequent Categories**:
   - The top column represents the most frequent category within each variable, while freq indicates the frequency of occurrence for that category.
   - For example, in the "MSZoning" variable, the most common category is "RL" with a frequency of 1151.
   - Identifying the most frequent categories can provide insights into the dominant characteristics of the dataset.

3. **Missing Values**:
   - The difference between the count and the expected count (1460) can indicate missing values within categorical variables.
   - For example, the "Alley" variable has 91 non-missing values out of 1460 instances, indicating a significant number of missing values.
   - Handling missing values is crucial to ensure the quality and integrity of the dataset during analysis.

4. **Property and Neighborhood Characteristics**:
   - Categorical variables such as "LotShape," "LandContour," "Utilities," "LotConfig," "LandSlope," etc., provide information about the characteristics of the properties and their surroundings.
   - Variables like "Neighborhood" can offer insights into the different neighborhoods represented in the dataset.

5. **Building and Garage Attributes**:
   - Variables like "BldgType," "HouseStyle," "RoofStyle," "RoofMatl," "Exterior1st," "Exterior2nd," etc., describe the building and architectural attributes of the houses.
   - These variables can provide information on the style, material, and external features of the properties.

6. **Quality and Condition Ratings**:
   - Categorical variables such as "ExterQual," "ExterCond," "BsmtQual," "BsmtCond," "HeatingQC," "KitchenQual," "FireplaceQu," etc., represent quality and condition ratings of various components of the houses.
   - Analyzing these variables can help identify the distribution of ratings and their potential impact on the sale prices.

7. **Sale Details**:
   - Categorical variables like "SaleType" and "SaleCondition" provide information about the type and condition of the sale transactions.
   - Understanding the distribution of sale types and conditions can provide insights into the nature of the sales represented in the dataset.

These insights from the categorical variables can help in understanding the different attributes, characteristics, and distributions within the dataset. Further analysis can involve visualizations, exploring relationships with the target variable, handling missing values, and conducting more in-depth exploratory analysis.

    Cool now I want to go back and hit bullet point 2 on the target variable. I'd like to start visualizing some of the things chatGPT mentioned

Prompt: Awesome. Going back to your second bullet point from earlier, Target Variable, can you write me some code to visualize everything you mentioned?

    It gave me some code to visualize stuff like sales price. I'll run it in the next block after I change a few things such as the df names

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting histogram of SalePrice
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='SalePrice', kde=True)
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Count')
plt.show()

# Plotting boxplot of SalePrice by Neighborhood
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='Neighborhood', y='SalePrice')
plt.title('Sale Price Distribution by Neighborhood')
plt.xlabel('Neighborhood')
plt.ylabel('Sale Price')
plt.xticks(rotation=45)
plt.show()

# Plotting violinplot of SalePrice by OverallQual
plt.figure(figsize=(8, 6))
sns.violinplot(data=df, x='OverallQual', y='SalePrice')
plt.title('Sale Price Distribution by Overall Quality')
plt.xlabel('Overall Quality')
plt.ylabel('Sale Price')
plt.show()

# Plotting countplot of MSZoning
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='MSZoning')
plt.title('Count of Properties by Zoning Classification')
plt.xlabel('Zoning Classification')
plt.ylabel('Count')
plt.show()
