# Lab 8: Define and Solve an ML Problem of Your Choosing

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [3]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename, header=0)

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

#### Model Objective 

Using the Airbnb dataset, I will implement a traditional neural network to determine the host's acceptance rate based on their 'About Me' description. 

#### Define the Label

I will work with the Airbnb data set, which includes information about the location of the listing, the price, the number of bedrooms and bathrooms, the amenities offered, and the reviews of the listing. It contains the host acceptance rate, which is the label. This is a regression problem.

#### Identify Features

Each of the host acceptance rate corresponds to the host's about description. The features will be comprised of a vectorization of the about description, containing info about the words and length of the descroption. 

#### Problem Overview

By using this model, consumers, especially those from marginalized backgrounds, can seek hosts that are more accepting of them. Consequently, potential misunderstandings and conflicted can be prevents, leading to smoother interactions and fewer complaints. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

1. To account for missing data, I will find and remove the missing values from the dataset. Also, since I'll be using a NLP so the machine would understand human text, I will also transform text into features using word embeddings.
2. To predict the host acceptance rate, I would need a logistic regression model.
3. To evaluate my model, I will use the metrics of Mean Squared Error and R-Square Error

In [4]:
# Remove missing data
df.dropna(subset=['host_acceptance_rate'], inplace=True)

# Set label and feature
y = df['host_acceptance_rate']
X = df['host_about']

In [5]:
# Preprocess words to prepare for word embeddings 
import gensim

def preprocess_text(text):
    if isinstance(text, float) and np.isnan(text):
        return []  # or return some placeholder like ['<NA>']
    return gensim.utils.simple_preprocess(str(text))

X = X.apply(preprocess_text)
X

0        [new, yorker, since, my, passion, is, creating...
1        [laid, back, native, new, yorker, formerly, bi...
2        [rebecca, is, an, artist, designer, and, henoc...
3        [used, to, work, for, financial, industry, but...
5        [hello, will, be, welcoming, and, helpful, whi...
                               ...                        
28016                                                   []
28017                                                   []
28018    [hello, my, name, is, sam, am, real, estate, p...
28019                                                   []
28020    [am, graphic, designer, swell, chaser, and, du...
Name: host_about, Length: 16909, dtype: object

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Label is host acceptance rate. 
Feature is host about me description. 

To train my model, I will take my preprocessed text data and split the data into training and testing data sets. Then, I will train a Word2Vec model using the training data X_train. Next, I will create feature vectors out of word embeddings for a classifier (aka. convert the features in our training and test datasets into feature vectors using our word embeddings). Finally, I will fit a logistic regression model to the training data and evaluate the model using the following metrics: Mean Squared Error and R-Squared Error. 

In [8]:
# Split datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1234)

In [10]:
# Train word2vec model
print("Begin")
word2vec_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100,
                                   window=5,
                                   min_count=2)

print("End")
print(len(word2vec_model.wv.key_to_index))

Begin
End
8303


In [11]:
# Create feature vectors out of word embeddings 
words = set(word2vec_model.wv.index_to_key)

print('Begin transforming X_train')
X_train_word_embeddings = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_train], dtype=object)
print('Finish transforming X_train')

print('Begin transforming X_test')
X_test_word_embeddings = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_test], dtype=object)
print('Finish transforming X_test')

Begin transforming X_train
Finish transforming X_train
Begin transforming X_test
Finish transforming X_test


In [15]:
# Print vectors 
print('Number of words in first training example: {0}'.format(len(X_train.iloc[0])))
print('First word in first training example: {0}'.format(X_train.iloc[0][0]))
print('Second word in first training example: {0}\n'.format(X_train.iloc[0][1]))

print('Number of word vectors in first training example: {0}'.format(len(X_train_word_embeddings[0])))
print('First word vector in first training example:\n {0}'.format(X_train_word_embeddings[0][0]))
print('\nSecond word vector in first training example: \n {0}\n'.format(X_train_word_embeddings[0][1]))

Number of words in first training example: 19
First word in first training example: new
Second word in first training example: york

Number of word vectors in first training example: 18
First word vector in first training example:
 [ 0.11737563  0.896021   -0.49120602 -0.7421468   1.1164032  -1.2969218
  0.7050462   1.1575849   2.5597365   0.59092355  0.42123252  0.4802024
  1.2551752  -0.68535614  1.6734041  -0.5253363   0.44499224 -1.8884573
 -0.7456964  -0.97429156  2.025835    0.20479389 -1.3192191  -0.28374088
 -1.0156378  -0.30201626 -0.16815338 -1.4669882  -0.198904   -1.0560857
  0.97489697 -2.9086552   0.4927408  -0.840647   -0.08462626  0.8295789
  0.03550735  0.5764585   0.00462966  0.5668449  -0.47266197 -0.6159594
 -0.9393915   0.56821215  1.3723109   0.7657183  -1.2010082  -0.7578992
  1.7391115  -0.12209626  0.9553528   1.0611155   0.86389846  0.15592071
 -0.5648193   0.6044434   2.02077    -0.14422794 -0.20774497  1.4700819
  0.30919552 -1.0013762   2.414044   -0.498990

In [18]:
X_train_feature_vector = []
for w in X_train_word_embeddings:
    if w.size:
        X_train_feature_vector.append(w.mean(axis=0))
    else:
        X_train_feature_vector.append(np.zeros(100, dtype=float))
        
X_test_feature_vector = []
for w in X_test_word_embeddings:
    if w.size:
        X_test_feature_vector.append(w.mean(axis=0))
    else:
        X_test_feature_vector.append(np.zeros(100, dtype=float))

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
model = LinearRegression()
model.fit(X_train_feature_vector, y_train)

# 2. Make predictions on the transformed test data
predictions = model.predict(X_test_feature_vector)

# 3. Compute the Mean Squared Error and R-squared for the test data
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print('Mean Squared Error on the test data: {:.4f}'.format(mse))
print('R-squared on the test data: {:.4f}'.format(r2))

Mean Squared Error on the test data: 0.0744
R-squared on the test data: 0.0167


Because both error metrics are close to 0, the model has good bias and variance trade off. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [None]:
# SEE ABOVE

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

# YOUR CODE HERE