<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project: Web APIs & NLP

## Problem Statement

We are a group of home improvement consultants that provide suggestions on how to refurbish the houses in selected neighborhoods in Ames, Iowa, including selecting the best features for homeowners to renovate, in order to improve the value of their homes in a cost-effective way.

Based on the provided data, we will:
- build several multiple linear regression models and select one best-performing model as our production model
- based on our production model, explore and  recommend important features for home improvment
- build models for selected neighorboods, explore and  recommend important features for home improvment

## Background

House value are influented by the following factors:([*source*](https://www.opendoor.com/w/blog/factors-that-influence-home-value))
- Neighborhood comps
- Location
- Home size and usable space
- Age and condition
- Upgrades and updates
- The local market
- Economic indicators
- Interest rates

As home improvement consultants, we are more interested at the factors or features which can be improved on the exsiting houses. 


## Dataset and Data Directory
- The dataset  ([*source*](https://www.kaggle.com/competitions/dsi-us-11-project-2-regression-challenge/data)) contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.
- The Dataset has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers). ([*source*](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt))
- Some important features are listed as below 


|Feature|Type|Description|
|---|---|---|
|**SalePrice**|*Continuous*|sale price, we will treat it as house value| 
|**Neighborhood**|*nominal*|Physical locations within Ames city limits|
|**Overall Qual**|*ordinal*|Rates the overall material and finish of the house|
|**Year Built**|*Discrete*|Original construction date|
|**Mas Vnr Type**|*nominal*|Masonry veneer type|
|**Mas Vnr Area**|*Continuous*|PMasonry veneer area in square feet|
|**Foundation**|*Nominal*| Type of foundation|
|**BsmtFin Type 1**|*Ordinal*| Rating of basement finished area| 
|**BsmtFin SF 1**|*Continuous*|Type 1 finished square feet|
|**Total Bsmt SF**|*Continuous*|Total square feet of basement area|
|**Gr Liv Area**|*Continuous*|Above grade (ground) living area square feet|
|**Fireplaces**|*Discrete*|Number of fireplaces|
|**Garage Area**|*Continuous*|Size of garage in square feet|
|**Open Porch SF**|*Continuous*| Open porch area in square feet|
|**HeatingQC**|*Ordinal*|Heating quality and condition|
|**Bedroom**|*Discrete*|Bedrooms above grade (does NOT include basement bedrooms)|
|**Kitchen**|*Discrete*|Kitchens above grade|
|**KitchenQual**|*Ordinal*|Kitchen quality|
|**TotRmsAbvGrd**|*Discrete*|Total rooms above grade (does not include bathrooms)|


# Import libraries

In [152]:
# Imports:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', None)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler,RobustScaler
from sklearn.inspection import partial_dependence,PartialDependenceDisplay

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from statsmodels.graphics.gofplots import qqplot
import statsmodels.api as sm


import pickle
import requests
import time



# define some functions

In [153]:
def fetch_100_posts(subreddit, utc):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit' : subreddit,
        'size' : 100,
        'before': utc
    }
    
    print(f'fetching posts on {subreddit}...')
    res = requests.get(url, params)
    if res.status_code == 200:
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        #print(df[['subreddit', 'created_utc', 'selftext','title']].tail(1))
        #print('have fetched 100 posts on', subreddit)
        return df, df['created_utc'].tail(1)
    else:
        print('wrong',res.status_code)
        return None, None
        

In [154]:
def fetch_n_posts(subreddit, n=1000):
    current_time = int(time.time())
    
    #fetch the first 100 posts
    df,last_post_utc = fetch_posts(subreddit, current_time)
    print(f'have fetched 100 posts on {subreddit}')
    
    #fetch the rest posts
    for i in range(1,int(n/100)):
        df1, last_post_utc = fetch_100_posts(subreddit, last_post_utc)
        print(f'have fetched {(i+1)*100} posts on {subreddit} in total' )
        df = pd.concat(objs=[df,df1], axis=0)
        df = df.reset_index(drop=True)

        if i < 10:
            print('start to sleep for 5 seconds')
            time.sleep(5)
    print('have fetched all posts')
    
    df['title + selftext'] = df['title'] + ' ' + df['selftext']
    
    return df.loc[:,['subreddit','title + selftext']]
        
    

In [155]:
subreddit = "AskWomen"
n = 1000
df_askwomen = fetch_n_posts(subreddit, n)
df_askwomen.head()
    

fetching posts on {subreddit}...
   subreddit  created_utc   selftext  \
95  AskWomen   1658220063              
96  AskWomen   1658220031  [removed]   
97  AskWomen   1658220022              
98  AskWomen   1658219744  [removed]   
99  AskWomen   1658219304  [removed]   

                                                title  
95  women of Reddit, what are your lucrative side ...  
96    How to approach my partner about their porn use  
97  Woman of reddit married to doctors, what are t...  
98  What would you do when you realize the guy you...  
99  What should I do if I really like a family fri...  
done!
have fetched 100 posts on AskWomen
fetching posts on AskWomen...
have fetched 200 posts on AskWomen in total
start to sleep for 5 seconds
fetching posts on AskWomen...
have fetched 300 posts on AskWomen in total
start to sleep for 5 seconds
fetching posts on AskWomen...
have fetched 400 posts on AskWomen in total
start to sleep for 5 seconds
fetching posts on AskWomen...
have fetch

Unnamed: 0,subreddit,title + selftext
0,AskWomen,What podcasts serve as confidence booster for ...
1,AskWomen,How did you get rid of an old friend who alway...
2,AskWomen,Why do they? [removed]
3,AskWomen,Why do? [removed]
4,AskWomen,Why do men? [removed]


In [None]:
subreddit = "Askmen"
n = 1000
df_askmen = fetch_n_posts(subreddit, n)
df_askmen.head()
    

fetching posts on {subreddit}...
   subreddit  created_utc   selftext  \
95    AskMen   1658226153              
96    AskMen   1658226125  [removed]   
97    AskMen   1658225799              
98    AskMen   1658225789  [removed]   
99    AskMen   1658225431  [removed]   

                                                title  
95  What’s the weirdest thing to happen in a locke...  
96  Is it really ‘the quiet girls’ you need to wat...  
97  What's your all-time favorite movie that you'v...  
98                                              Hmmmm  
99                    just wanna understand something  
done!
have fetched 100 posts on Askmen
fetching posts on Askmen...
have fetched 200 posts on Askmen in total
start to sleep for 5 seconds
fetching posts on Askmen...
have fetched 300 posts on Askmen in total
start to sleep for 5 seconds
fetching posts on Askmen...
have fetched 400 posts on Askmen in total
start to sleep for 5 seconds
fetching posts on Askmen...
have fetched 500 posts on 

In [None]:
df = pd.concat(objs=[df_askwomen,df_askmen], axis=0)
df_final = df.reset_index(drop=True)

In [None]:
df_final.shape

In [None]:
df['subreddit'].value_counts()

# pickle

In [None]:
pickle.dump(df_final, open('../datasets/data.pkl', 'wb'))

## Conclusions and Recommendations

- We built several models and found lasso models had the best performance
- Based on the lasso model, we analyzed and found ' BsmtFin SF 1' and 'Fireplaces' as  recommended features for home improvements
- We built models for selected neighborhoods and  recommended some feature for home improvement
- The neighorhood models also suggest that it is better to do home improvement for newer houses or high value houses because it can create more value 

## Limitation and Future Enhancement
- Our 3 multiple regression models have very similar performance, indicating our model might be underfitting, adding more futures may increase model performance
- Our models cannot meet the LINE assumption, other non-regression models should be considered
- Many data are highly imbalance and barely useful. Data collector might redesign what to be collected in the future
- Many neigbhorhoods have insufficient data to build meaningful models. More data is needed