<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project: Web APIs & NLP

## Problem Statement

We are a group of home improvement consultants that provide suggestions on how to refurbish the houses in selected neighborhoods in Ames, Iowa, including selecting the best features for homeowners to renovate, in order to improve the value of their homes in a cost-effective way.

Based on the provided data, we will:
- build several multiple linear regression models and select one best-performing model as our production model
- based on our production model, explore and  recommend important features for home improvment
- build models for selected neighorboods, explore and  recommend important features for home improvment

## Background

House value are influented by the following factors:([*source*](https://www.opendoor.com/w/blog/factors-that-influence-home-value))
- Neighborhood comps
- Location
- Home size and usable space
- Age and condition
- Upgrades and updates
- The local market
- Economic indicators
- Interest rates

As home improvement consultants, we are more interested at the factors or features which can be improved on the exsiting houses. 


## Dataset and Data Directory
- The dataset  ([*source*](https://www.kaggle.com/competitions/dsi-us-11-project-2-regression-challenge/data)) contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.
- The Dataset has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers). ([*source*](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt))
- Some important features are listed as below 


|Feature|Type|Description|
|---|---|---|
|**SalePrice**|*Continuous*|sale price, we will treat it as house value| 
|**Neighborhood**|*nominal*|Physical locations within Ames city limits|
|**Overall Qual**|*ordinal*|Rates the overall material and finish of the house|
|**Year Built**|*Discrete*|Original construction date|
|**Mas Vnr Type**|*nominal*|Masonry veneer type|
|**Mas Vnr Area**|*Continuous*|PMasonry veneer area in square feet|
|**Foundation**|*Nominal*| Type of foundation|
|**BsmtFin Type 1**|*Ordinal*| Rating of basement finished area| 
|**BsmtFin SF 1**|*Continuous*|Type 1 finished square feet|
|**Total Bsmt SF**|*Continuous*|Total square feet of basement area|
|**Gr Liv Area**|*Continuous*|Above grade (ground) living area square feet|
|**Fireplaces**|*Discrete*|Number of fireplaces|
|**Garage Area**|*Continuous*|Size of garage in square feet|
|**Open Porch SF**|*Continuous*| Open porch area in square feet|
|**HeatingQC**|*Ordinal*|Heating quality and condition|
|**Bedroom**|*Discrete*|Bedrooms above grade (does NOT include basement bedrooms)|
|**Kitchen**|*Discrete*|Kitchens above grade|
|**KitchenQual**|*Ordinal*|Kitchen quality|
|**TotRmsAbvGrd**|*Discrete*|Total rooms above grade (does not include bathrooms)|


# Import libraries

In [1]:
# Imports:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', None)

import pickle
import requests
import time



# define some functions

In [2]:
#the function will:
# 1. fetch 100 posts from a subreddit, those post are created before the input time
# 2. return:
#           - the dataframe holding the fetched posts 
#           - the created time of last post
def fetch_100_posts(subreddit,  
                    utc  #the input time 
                   ):
    
    #create the url and parameters
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit' : subreddit,
        'size' : 100,    #how many posts will the function fetch
        'before': utc 
    }
    
    #fetch the posts
    res = requests.get(url, params)
    
    #if successfully fetched, create and return a dataframe which hold the posts
    if res.status_code == 200:
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        #print(df[['subreddit', 'created_utc', 'selftext','title']].tail(1))
        #print('have fetched 100 posts on', subreddit)
        return df, df['created_utc'].tail(1)
    else:
        print('wrong',res.status_code)
        return None, None
        

In [3]:
#based on the input paramters, the function will fetch n posts from subreddit and return the dataframe holding all fetched posts 
def fetch_n_posts(subreddit, n=1000):
    
    #get current time
    current_time = int(time.time())
    
    #fetch the first 100 lastest-created posts
    df,last_post_utc = fetch_100_posts(subreddit, current_time)
    print(f'have fetched 100 posts on {subreddit}')
    
    #fetch the rest posts
    for i in range(1,int(n/100)):
        
        print('start to sleep for 5 seconds')
        time.sleep(5)
        
        df1, last_post_utc = fetch_100_posts(subreddit, last_post_utc)
        print(f'have fetched {(i+1)*100} posts on {subreddit} in total' )
        df = pd.concat(objs=[df,df1], axis=0)
        df = df.reset_index(drop=True)

        #if i < 10:
           
    print('have fetched all posts')
    
    #df['title + selftext'] = df['title'] + ' ' + df['selftext']
    
    #return df.loc[:,['subreddit','title + selftext']]
    return df
    

#### Fetch 2000 posts from subreddit 'AskWomen'

In [4]:
subreddit = "AskWomen"
n = 2000
df_askwomen = fetch_n_posts(subreddit, n)
df_askwomen.head()
    

have fetched 100 posts on AskWomen
start to sleep for 5 seconds
have fetched 200 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 300 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 400 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 500 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 600 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 700 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 800 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 900 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 1000 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 1100 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 1200 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 1300 posts on AskWomen in total
start to sleep for 5 seconds
have fetched 1400 posts on AskWomen in total
start to sl

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,author_flair_template_id,author_flair_text_color,author_flair_background_color,banned_by,author_cakeday,post_hint,preview,call_to_action,category
0,[],False,naughtygeekyredditor,,[],,text,t2_6ch7kb03,False,False,False,[],False,False,1658491375,self.AskWomen,https://www.reddit.com/r/AskWomen/comments/w58...,{},w58vxv,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,True,all_ads,/r/AskWomen/comments/w58vxv/how_often_do_you_m...,False,6,automod_filtered,1658491386,1,[removed],True,False,False,AskWomen,t5_2rxrw,3822025,public,top,nsfw,How often do you masturbate?,0,[],1.0,https://www.reddit.com/r/AskWomen/comments/w58...,promo_adult_nsfw,3,,,,,,,,,
1,[],False,kia-audi-spider-legs,,[],,text,t2_kjkrn8oa,False,False,False,[],False,False,1658491121,self.AskWomen,https://www.reddit.com/r/AskWomen/comments/w58...,{},w58shp,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/AskWomen/comments/w58shp/how_would_you_hear...,False,6,moderator,1658491132,1,[removed],True,False,False,AskWomen,t5_2rxrw,3822016,public,top,self,How would you hear “No one could ever be as at...,0,[],1.0,https://www.reddit.com/r/AskWomen/comments/w58...,all_ads,6,,,,,,,,,
2,[],False,kia-audi-spider-legs,,[],,text,t2_kjkrn8oa,False,False,False,[],False,False,1658491016,self.AskWomen,https://www.reddit.com/r/AskWomen/comments/w58...,{},w58rbp,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/AskWomen/comments/w58rbp/no_one_could_ever_...,False,6,moderator,1658491027,1,[removed],True,False,False,AskWomen,t5_2rxrw,3822012,public,top,self,“No one could ever be as attracted to you as I...,0,[],1.0,https://www.reddit.com/r/AskWomen/comments/w58...,all_ads,6,,,,,,,,,
3,[],False,tsbxred,female,[],♀,text,t2_om297,False,False,False,[],False,False,1658490938,self.AskWomen,https://www.reddit.com/r/AskWomen/comments/w58...,{},w58qi4,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/AskWomen/comments/w58qi4/what_is_your_exper...,False,6,automod_filtered,1658490949,1,[removed],True,False,False,AskWomen,t5_2rxrw,3822007,public,top,self,What is your experience of staying with a miso...,0,[],1.0,https://www.reddit.com/r/AskWomen/comments/w58...,all_ads,6,8106c61a-c8aa-11e1-a771-12313b0ce1e2,dark,,,,,,,
4,[],False,Spiritual-Ad8437,,[],,text,t2_7xt6wpkw,False,False,False,[],False,False,1658490589,self.AskWomen,https://www.reddit.com/r/AskWomen/comments/w58...,{},w58mnf,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/AskWomen/comments/w58mnf/why_is_it_more_soc...,False,6,automod_filtered,1658490600,1,[removed],True,False,False,AskWomen,t5_2rxrw,3821993,public,top,self,Why is it more socially acceptable when a woma...,0,[],1.0,https://www.reddit.com/r/AskWomen/comments/w58...,all_ads,6,,,,,,,,,


#### Fetch 2000 posts from subreddit 'Askmen'

In [5]:
subreddit = "Askmen"
n = 2000
df_askmen = fetch_n_posts(subreddit, n)
df_askmen.head()
    

have fetched 100 posts on Askmen
start to sleep for 5 seconds
have fetched 200 posts on Askmen in total
start to sleep for 5 seconds
have fetched 300 posts on Askmen in total
start to sleep for 5 seconds
have fetched 400 posts on Askmen in total
start to sleep for 5 seconds
have fetched 500 posts on Askmen in total
start to sleep for 5 seconds
have fetched 600 posts on Askmen in total
start to sleep for 5 seconds
have fetched 700 posts on Askmen in total
start to sleep for 5 seconds
have fetched 800 posts on Askmen in total
start to sleep for 5 seconds
have fetched 900 posts on Askmen in total
start to sleep for 5 seconds
have fetched 1000 posts on Askmen in total
start to sleep for 5 seconds
have fetched 1100 posts on Askmen in total
start to sleep for 5 seconds
have fetched 1200 posts on Askmen in total
start to sleep for 5 seconds
have fetched 1300 posts on Askmen in total
start to sleep for 5 seconds
have fetched 1400 posts on Askmen in total
start to sleep for 5 seconds
have fetch

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,author_flair_background_color,author_flair_template_id,author_flair_text_color,banned_by,post_hint,preview,author_cakeday
0,[],False,capuccinohedgie,,[],,text,t2_8q465stc,False,False,False,[],False,False,1658491492,self.AskMen,https://www.reddit.com/r/AskMen/comments/w58xe...,{},w58xer,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/AskMen/comments/w58xer/is_wearing_a_wig_a_t...,False,6,moderator,1658491502,1,[removed],True,False,False,AskMen,t5_2s30g,4000940,public,self,Is wearing a wig a turn off,0,[],1.0,https://www.reddit.com/r/AskMen/comments/w58xe...,all_ads,6,,,,,,,
1,[],False,wondroussarah,,[],,text,t2_976wlsse,False,False,False,[],False,False,1658491238,self.AskMen,https://www.reddit.com/r/AskMen/comments/w58u6...,{},w58u6v,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/AskMen/comments/w58u6v/do_men_come_up_to_mo...,False,6,moderator,1658491249,1,[removed],True,False,False,AskMen,t5_2s30g,4000934,public,self,Do men come up to MOST women to ask for relati...,0,[],1.0,https://www.reddit.com/r/AskMen/comments/w58u6...,all_ads,6,,,,,,,
2,[],False,Commercial_Fuel_3519,,[],,text,t2_3l9elkem,False,False,False,[],False,False,1658490909,self.AskMen,https://www.reddit.com/r/AskMen/comments/w58q6...,{},w58q67,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/AskMen/comments/w58q67/you_have_a_16_year_o...,False,6,,1658490920,1,,True,False,False,AskMen,t5_2s30g,4000920,public,self,You have a 16 year old daughter who generally ...,0,[],1.0,https://www.reddit.com/r/AskMen/comments/w58q6...,all_ads,6,,,,,,,
3,[],False,ZoneWestern464,,[],,text,t2_eu514prg,False,False,False,[],False,False,1658490844,self.AskMen,https://www.reddit.com/r/AskMen/comments/w58pg...,{},w58pg8,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/AskMen/comments/w58pg8/mods_are_vir/,False,6,moderator,1658490855,1,[removed],True,False,False,AskMen,t5_2s30g,4000920,public,self,mods are vir....,0,[],1.0,https://www.reddit.com/r/AskMen/comments/w58pg...,all_ads,6,,,,,,,
4,[],False,Recent-Manager-9875,,[],,text,t2_pemzjrr2,False,False,False,[],False,False,1658490840,self.AskMen,https://www.reddit.com/r/AskMen/comments/w58pe...,{},w58peh,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/AskMen/comments/w58peh/do_you_have_a_specif...,False,6,moderator,1658490850,1,[removed],True,False,False,AskMen,t5_2s30g,4000920,public,self,Do you have a specific hair routine and what k...,0,[],1.0,https://www.reddit.com/r/AskMen/comments/w58pe...,all_ads,6,,,,,,,


# pickle

In [6]:
pickle_dict = dict()
pickle_dict['df_askwomen'] = df_askwomen 
pickle_dict['df_askmen'] = df_askmen 

#in order to keep the data constant, we will pickle the data into 'data2.pkl' for demonstration in this notebook.
#the actual data for building models has been pickled into 'data.pkl'
pickle.dump(pickle_dict, open('../datasets/data2.pkl', 'wb'))