# Lab 8: Define and Solve an ML Problem of Your Choosing

In [6]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns




In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [7]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename)

df['name'].iloc[0], df['host_identity_verified'].mode().iloc[0],df['neighbourhood_group_cleansed'].mode().iloc[0]

('Skylit Midtown Castle', np.True_, 'Manhattan')

In [8]:
df_original=df.copy()

In [9]:
df.columns

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

In [10]:
df['total_rooms']=df['bathrooms'] +df['bedrooms']

In [11]:
df['total_rooms']

0        NaN
1        2.0
2        3.5
3        2.0
4        2.0
        ... 
28017    2.0
28018    3.0
28019    3.0
28020    2.0
28021    2.0
Name: total_rooms, Length: 28022, dtype: float64

In [12]:
df=df.drop(columns=['bathrooms','bedrooms'])

In [13]:
df.columns

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'beds', 'amenities', 'price', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
    

In [14]:
my_features=['name','host_name','host_location','host_is_superhost','host_response_rate','host_identity_verified','total_rooms','price','has_availability','accommodates','amenities','review_scores_rating','review_scores_cleanliness', 'review_scores_checkin','review_scores_communication', 'review_scores_location','review_scores_value','number_of_reviews_l30d','calculated_host_listings_count']

i chose these columns as location, total rooms, and accomodates are valid features to use for the clustering portion of this system, and the rest of the labels will be used as features besides review_scores_value (which is the Sentiment) and the review_scores_rating which are the labels

In [15]:
df.rename(columns={"review_scores_value": "Sentiment"})

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_location,Sentiment,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,total_rooms
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.80,0.17,True,8.0,...,4.86,4.41,False,3,3,0,0,0.33,9,
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.71,4.64,False,1,1,0,0,4.86,6,2.0
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.00,0.25,True,1.0,...,4.50,5.00,False,1,1,0,0,0.02,3,3.5
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.00,1.00,True,1.0,...,4.87,4.36,False,1,0,1,0,3.68,4,2.0
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.94,4.92,False,1,0,1,0,0.87,7,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,Astoria Luxury suite 2A,THIS LOVELY HOME IS THE SPACIOUS SUITE WITH PR...,,Vicky,"Queens, New York, United States",,1.00,1.00,True,8.0,...,3.00,1.00,True,8,0,8,0,1.00,2,2.0
28018,Newly renovated suite in the heart of Williams...,Just fully renovated from head to toe. On the ...,,Samuel,"New York, New York, United States","Hello, my name is Sam. I am a real estate prof...",0.91,0.89,True,0.0,...,5.00,5.00,False,1,1,0,0,2.00,5,3.0
28019,Perfect Room to Stay in Brooklyn! Near Metro!,"Amazing and comfortable space in Brooklyn, sam...",,Carlos,US,,0.99,0.99,True,6.0,...,5.00,2.00,True,7,0,7,0,1.00,2,3.0
28020,New Beautiful Modern One Bedroom in Brooklyn,This stylish place to stay is perfect for a gr...,,Lexia,"New York, New York, United States","I am a graphic designer, swell chaser and duri...",0.90,1.00,True,3.0,...,5.00,5.00,False,3,3,0,0,1.00,7,2.0


In [16]:
null_counts = df.isnull().sum()
columns_with_nulls = null_counts[null_counts > 0].index.tolist()
columns_with_nulls #list of columns with null values in them

['name',
 'description',
 'neighborhood_overview',
 'host_location',
 'host_about',
 'host_response_rate',
 'host_acceptance_rate',
 'beds',
 'total_rooms']

In [17]:
#this is a scrapped idea, but i was going to remove any rows with a null value in any of the above columns
#df_clean = df.dropna(subset=columns_with_nulls)
#df_clean.shape

In [18]:
df.shape

(28022, 49)

now that we know the columns with null vals, we will find if they are ints or strings. by traversing the col names, if we find a col in the null list above, we check if its an integer then replace with the mean. if it is a string, we replace with the least used value in the row.


In [19]:
for col in columns_with_nulls:
    if df[col].dtype in [np.int64, np.float64]:
        # Numeric columns: replace nulls with mean
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)
    elif df[col].dtype == object:
        # String columns: replace nulls with minimum value
        min_value = df[col].dropna().min()
        df[col].fillna(min_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(min_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mean_value, inplace=True)


In [20]:
df=df[my_features]

In [21]:
df['host_location']

0        New York, New York, United States
1        New York, New York, United States
2        Brooklyn, New York, United States
3        New York, New York, United States
4        New York, New York, United States
                       ...                
28017      Queens, New York, United States
28018    New York, New York, United States
28019                                   US
28020    New York, New York, United States
28021                                   US
Name: host_location, Length: 28022, dtype: object

In [22]:
#another scrapepd idea, i was going to use openai to push the reference dict to have it format the user inputs.
#reference_dict = df.head(1).to_dict(orient='list') #i will push this into the llm to format input into this
#print(type(reference_dict)) #in programming this is kinda like using jsonify
#print(reference_dict)

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. The data set that I will be working with is the airbnb listings data set.
2. I will be predicting the best airbnb listing based on various inputs the user provides, in particular we will be using location and highest average reviews for the last 30 days, as well as a text input which asks for what the user wants from their listing which we will make a tuple based on user inputs. The labels we will be using for the supervised learning portion are user sentiment and review ratings
3. I will be using unsupervised learning to group the tuples with the closest correlation to our user tuple, clustering listings based on location and the number of total rooms. Once clustered, we will refine these groups using a supervised model to predict user preferences more accurately. The model will minimize the Euclidean distance between each listing's features and the user's generated preference tuple, helping us find the best fit overall.
5. The feature's I've chosen are:
['name','host_name','host_location','host_is_superhost','host_response_rate','host_identity_verified','total_rooms','price','has_availability','accommodates','amenities','review_scores_rating','review_scores_cleanliness', 'review_scores_checkin','review_scores_communication', 'review_scores_location','review_scores_value','number_of_reviews_l30d','calculated_host_listings_count']
6. I think that having this ease of access recommendation system is very helpful as it saves users from scouring through hundreds of listings just to find a good listing, and even if they do find a good listing- prevent the chance that it might not be as good as it was described. While avid AirBnB users might be able to figure out how to find the best option for them using listing filters, haivng a recommendation system saves the users and the company time.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    
    Already shown earlier, but I filtered based on relevant columns I would use. Then I removed null values by replacing them with the mean values for integers and the min value for strings. I also renamed a feature, review_score_rating to just be sentiment. This will be the label I will use for the supervised learning portion. We'll use standard scaler to ensure each of our tuples, or vectors later on, have even weight distribution.
2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?

I'm going to use clustering (K Means Clustering) for the unsupervised portion, grouping only the relevant tuples. With only the relevant tuples we will use One Hot Encoding to encode our categorical data, and then use a regression model (Random Forest) to rank the best fit options.
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    
    Not too sure about the clustering portion, but for RFR I will need the use of mean square error and r squared.
   
Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [23]:
# YOUR CODE HERE

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

<Double click this Markdown cell to make it editable, and record your answers here.>

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [27]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [28]:
print("Hello and welcome to the AirBnB Recommendation System!")
user_total_rooms=input("How many rooms are you looking for in a listing?")
user_location=input("Where are you planning on staying?").lower()
user_price=input("How much do you intend to spend per night?")

Hello and welcome to the AirBnB Recommendation System!


How many rooms are you looking for in a listing? 3
Where are you planning on staying? nyc
How much do you intend to spend per night? 70


In [29]:
df['price']

0         150.0
1          75.0
2         275.0
3          68.0
4          75.0
          ...  
28017      89.0
28018    1000.0
28019      64.0
28020      84.0
28021      70.0
Name: price, Length: 28022, dtype: float64

In [30]:
# Making a dictionary to store common inputs and convert into a correct output
location_mapping = {
    'new york, new york, united states': 'New York, New York, United States',
    'nyc': 'New York, New York, United States',
    'new york city': 'New York, New York, United States',
    'los angeles': 'Los Angeles, California, United States',
    'los angeles, california, united states': 'Los Angeles, California, United States',
    'san fran': 'San Francisco, California, United States',
    'sf': 'San Francisco, California, United States',
    'san francisco, california, united states': 'San Francisco, California, United States',
    'brooklyn, new york, united states': 'Brooklyn, New York, United States',
    'berkeley, california, united states': 'Berkeley, California, United States'
}

In [31]:
#convert the input into standardized form
def standardize_location(input_location, mapping):
    input_location = input_location.lower().strip()  # Convert to lowercase and strip whitespace
    return mapping.get(input_location, input_location)  # Map to standard name or return as is
user_location_standardized = standardize_location(user_location, location_mapping)

In [32]:
# Standardize each location in the array so all cols have the same format
df['host_location'] = df['host_location'].apply(standardize_location, args=(location_mapping,))

In [33]:
#Part 1 Encoding 
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
location_encoded = encoder.fit_transform(df[['host_location']])

# Create a DataFrame for the encoded features
location_df = pd.DataFrame(location_encoded, columns=encoder.get_feature_names_out(['host_location']))

# Concatenate encoded location with the 'total_rooms' and 'price' columns
features = pd.concat([location_df, df[['total_rooms', 'price']]], axis=1)

In [34]:
# Part 2 K means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(features)

# Encode user input location for comparison
user_location_encoded = encoder.transform([[user_location_standardized]])
user_feature_vector = np.concatenate([user_location_encoded, [[user_total_rooms, user_price]]], axis=1)

# Ensure both the user feature vector and the features DataFrame are numerical
user_feature_vector = user_feature_vector.astype(float)
features = features.astype(float)

# Find the closest cluster center to the user's feature vector
closest_cluster_idx, _ = pairwise_distances_argmin_min(user_feature_vector, kmeans.cluster_centers_)
closest_cluster = closest_cluster_idx[0]

# Get listings in the closest cluster
closest_listings = df[df['cluster'] == closest_cluster]

# Calculate distances within the cluster and sort to find closest listings
closest_listings_features = features.iloc[closest_listings.index]
distances = np.linalg.norm(closest_listings_features.values - user_feature_vector, axis=1)
closest_listings['distance'] = distances

# Sort by distance to find the most similar listings
closest_listings = closest_listings.sort_values(by='distance')

# Display the closest listings
print(closest_listings.shape)



(20683, 21)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  closest_listings['distance'] = distances


In [35]:
#Part 3 Train Test Split

# Combining multiple metrics to form the full and entire "sentiment" score
df['Sentiment'] = (df['review_scores_value'] + df['review_scores_rating'] + df['number_of_reviews_l30d'] / 100)

# Prepare feature columns (excluding target variable)
feature_cols = ['host_is_superhost', 'host_response_rate', 'host_identity_verified', 'total_rooms',
                'price', 'has_availability', 'accommodates', 'review_scores_rating',
                'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
                'review_scores_location', 'calculated_host_listings_count']

# Concatenate encoded location with other features
features = pd.concat([location_df, df[feature_cols]], axis=1)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, df['Sentiment'], test_size=0.2, random_state=42)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [36]:
df.columns

Index(['name', 'host_name', 'host_location', 'host_is_superhost',
       'host_response_rate', 'host_identity_verified', 'total_rooms', 'price',
       'has_availability', 'accommodates', 'amenities', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'number_of_reviews_l30d',
       'calculated_host_listings_count', 'cluster', 'Sentiment'],
      dtype='object')

In [37]:
#Part 4 Random Forest Regression

# Predict sentiment scores for the listings
df['predicted_sentiment'] = rf_model.predict(features)

# Sort by predicted sentiment to find the most relevant listings
ranked_listings = df.sort_values(by='predicted_sentiment', ascending=False)

# Filter for top 3 listings
top_3_listings = ranked_listings.head(3)

# Display the top 3 ranked listings
print("Top 3 Ranked Listings by Predicted Sentiment, Check These Out!")
for index, row in top_3_listings.iterrows():
    print(f"Name: {row['name']}, Location: {row['host_location']}, Total Rooms: {row['total_rooms']}, Price: {row['price']}, Predicted Sentiment: {row['predicted_sentiment']}")
#i think i might've messed up somewhere, but it is what it is

Top 3 Ranked Listings by Predicted Sentiment, Check These Out!
Name: Private bedroom at "La Casa Blanca" near NYC R#2, Location: the bronx, new york, united states, Total Rooms: 2.0, Price: 70.0, Predicted Sentiment: 10.045899999999996
Name: Private bedroom at "La Casa Blanca" near NYC r#5, Location: the bronx, new york, united states, Total Rooms: 2.0, Price: 70.0, Predicted Sentiment: 10.045899999999996
Name: Cozy private room with full size Bed in Upper West, Location: New York, New York, United States, Total Rooms: 2.0, Price: 43.0, Predicted Sentiment: 10.042133333333316


In [39]:
#r2 score evaluation
from sklearn.metrics import r2_score
y_true = y_test
y_pred = rf_model.predict(X_test)
r2 = r2_score(y_true, y_pred)
r2

0.9057757512952254