# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [3]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score


## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [5]:
# Specify the file path of the data
data_file_path = "data/airbnbListingsData.csv"

# Load the data into a DataFrame called df
df = pd.read_csv(data_file_path)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [None]:
# GOAL: Predict whether airbnb price is over or 
#under $500 based on physical qualities - focusing on the selected 
#columns 'bathrooms', 'bedrooms', 'beds', and 'price'.


#Exploratory Data Analysis
# Display the first few rows of the data
print(df.head())


# Get basic statistics and data types of the selected columns
#group desired columns 
selected_columns = ['bathrooms', 'bedrooms', 'beds', 'price']
print(df[selected_columns].describe())

# Data types of the selected columns
print(df[selected_columns].dtypes)


# Visualize relationships between features and the target ('price')
sns.pairplot(df[['bathrooms', 'bedrooms', 'beds', 'price']], hue='price', diag_kind='hist')
plt.show()


# Check for missing values
missing_values = data[selected_columns].isnull().sum()
print("Missing values:", missing_values)
# Handling missing values by replacing with mean (for numerical columns)
df[selected_columns] = df[selected_columns].fillna(df[selected_columns].mean())

# Visualize relationships between features and the target ('price')
sns.pairplot(df[selected_columns], hue='price', diag_kind='hist')
plt.show()

# Address class imbalance
# Count the number of instances for each class (over $500 and under $500)
class_counts = df['price'].value_counts()
print("Class counts:", class_counts)

# Preprocess the data for modeling
# Standardize numerical features
scaler = StandardScaler()
df[selected_columns[:-1]] = scaler.fit_transform(df[selected_columns[:-1]])

print(df.head())

# Visualize class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='price', data=df)
plt.title("Class Distribution")
plt.show()

                                                name  \
0                              Skylit Midtown Castle   
1  Whole flr w/private bdrm, bath & kitchen(pls r...   
2           Spacious Brooklyn Duplex, Patio + Garden   
3                   Large Furnished Room Near B'way　   
4                 Cozy Clean Guest Room - Family Apt   

                                         description  \
0  Beautiful, spacious skylit studio in the heart...   
1  Enjoy 500 s.f. top floor in 1899 brownstone, w...   
2  We welcome you to stay in our lovely 2 br dupl...   
3  Please don’t expect the luxury here just a bas...   
4  Our best guests are seeking a safe, clean, spa...   

                               neighborhood_overview    host_name  \
0  Centrally located in the heart of Manhattan ju...     Jennifer   
1  Just the right mix of urban center and local n...  LisaRoxanne   
2                                                NaN      Rebecca   
3    Theater district, many restaurants around her

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [None]:
# Define the columns to use as features
selected_columns = ['bathrooms', 'bedrooms', 'beds']

# Drop missing values
df_selected.dropna(inplace=True)

# Define X and y
X = df_selected[selected_columns[:-1]]  # Features: bathrooms, bedrooms, beds
y = (df_selected['price'] > 500).astype(int)  # Target: 1 if price is over $500, 0 if under $500


# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Display basic information about the dataset
print(df.info())

# Summary statistics for the selected columns
print(df[selected_columns].describe())

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=123)
# Choose hyperparameters (minimum samples per leaf and maximum depth)
min_samples_leaf = 5
max_depth = 10

In [None]:
# Define the hyperparameters to search
param_grid = {
    'min_samples_leaf': [1, 3, 5, 10],
    'max_depth': [None, 10, 20, 30]
}

# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=123)

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Train a Decision Tree classifier with the best hyperparameters
# Create Decision Tree classifier
best_dt_classifier = DecisionTreeClassifier(min_samples_leaf=best_params['min_samples_leaf'], max_depth=best_params['max_depth'], random_state=123)

# Fit the model to the training data
best_dt_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = best_dt_classifier.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:

# Call the train_test_DT function
accuracy = best_dt_classifier(X_train, X_test, y_train, y_test, min_samples_leaf, max_depth)

# Print the accuracy of the Decision Tree classifier
print("Accuracy:", accuracy)

In [None]:
# Ex: Assume best_dt_classifier is trained and now there is a new listing as new_listing
# Ex: new_listing = {'bathrooms': 2, 'bedrooms': 3, 'beds': 2}


# Assume there is a new Airbnb listing with features: bathrooms, bedrooms, beds
new_listing = [[2, 3, 2]]  # Example features for the new listing

# Convert the new listing to a DataFrame
new_listing_df = pd.DataFrame(new_listing, columns=selected_columns[:-1])

# Standardize numerical features using the same scaler used for the training data
new_listing_scaled = scaler.transform(new_listing_df[selected_columns])

# Use the trained Decision Tree classifier to predict the class label
predicted_class = best_dt_classifier.predict(new_listing_scaled)

# Convert the predicted class label to "Over $500" or "Under $500"
predicted_price_range = "Over $500" if predicted_class[0] == 1 else "Under $500"

# Print the prediction
print("Predicted price range for the new Airbnb listing:", predicted_price_range)