## Project Statement





## Problem Statement
 Current house pricing metrics within Zillow are outdated. The predictions are made primarily on bed and bath and square footage. With the movement to a more micro style of living, we should be evaluating more on the overall quality instead of the overall quantity.


## Executive Summary
### Contents
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Numerical Features](#Numerical-Features)
- [Categorical Features](#Categorical-Features)
- [Parameters](#Parameters)
- [Score](#Score)
- [Prediction](#Prediction)
- [Save to file](#Save-to-file)

In [2]:
import requests
import time
import pandas as pd
import urllib3
import json
import lxml.html as lh
import datetime as dt
import nltk
import regex as re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

In [2]:
lemmatizer = WordNetLemmatizer()

In [3]:
def query_pushshift(subreddit, kind='submission', skip=30, times=15, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):

    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    
    for x in range(1, times):
        
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        time.sleep(2)
        
    full = pd.concat(mylist, sort=False)
    
    if kind == "submission":
        
        full = full[subfield]
        
        full = full.drop_duplicates()
        
        full = full.loc[full['is_self'] == True]
        
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date)
    
    full['timestamp'] = _timestamp

    print(full.shape)
    
    return full 

In [4]:
evolution = query_pushshift('evolution')

https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=evolution&size=500&after=330d
ht

In [5]:
conspiracy = query_pushshift('conspiracy')

https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=conspiracy&size=500&af

In [6]:
evolution.to_csv('../datasets/evolution.csv')

In [7]:
conspiracy.to_csv('../datasets/conspiracy.csv')

In [75]:
evolution = pd.read_csv('../datasets/evolution.csv')

In [76]:
conspiracy = pd.read_csv('../datasets/conspiracy.csv')

In [77]:
evolution.head()

Unnamed: 0.1,Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,0,Why is it almost always the cultures and civil...,,evolution,1542727068,JacksonBark,41,1,True,2018-11-20
1,5,are animals even remotely conscious?,is the evolution of domesticated animals drivi...,evolution,1542775115,pringlesgalore,8,1,True,2018-11-20
2,6,baby chameleon has strange adaptation?,watching bbc madagascar on netflix and these t...,evolution,1542776009,pringlesgalore,1,1,True,2018-11-20
3,7,"The Athletties: Dekey Pete rises, the DJs are ...",[https://theathletic.com/660010/2018/11/18/the...,evolution,1542800440,thigpeneckard,2,1,True,2018-11-21
4,10,Dolphins vs Lucy the Australopithecus,I ended up in an argument with a friend today ...,evolution,1542845513,EvoSoldior,9,1,True,2018-11-21


In [78]:
conspiracy.head()

Unnamed: 0.1,Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,2,DEW weapons are not a conspiracy theory but a ...,"Ionispheroc heaters, smart meters, DEW, ect......",conspiracy,1542726123,ilikerealmaplesyrup,16,1,True,2018-11-20
1,3,Pluto is a planet (and also the Scientific Met...,I feel the demotion of Pluto was a test. To se...,conspiracy,1542726148,CybergothiChe,30,1,True,2018-11-20
2,6,Are humans really Frugivores? Have we been liv...,I'm not sure about posting this here but I don...,conspiracy,1542726992,Uyrr,98,1,True,2018-11-20
3,7,FlashBack - Traveling on Delta Flight 1989 on ...,[removed],conspiracy,1542727169,outoftheMultiverse,0,1,True,2018-11-20
4,9,Discussion Thread Hancock's Stoned Ape Theory ...,Just curious as to what this sub thinks about ...,conspiracy,1542727513,kthxbrit,33,1,True,2018-11-20


In [79]:
evolution.drop('Unnamed: 0', axis=1, inplace=True)

In [80]:
conspiracy.drop('Unnamed: 0', axis=1, inplace=True)

In [81]:
evolution.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Why is it almost always the cultures and civil...,,evolution,1542727068,JacksonBark,41,1,True,2018-11-20
1,are animals even remotely conscious?,is the evolution of domesticated animals drivi...,evolution,1542775115,pringlesgalore,8,1,True,2018-11-20
2,baby chameleon has strange adaptation?,watching bbc madagascar on netflix and these t...,evolution,1542776009,pringlesgalore,1,1,True,2018-11-20
3,"The Athletties: Dekey Pete rises, the DJs are ...",[https://theathletic.com/660010/2018/11/18/the...,evolution,1542800440,thigpeneckard,2,1,True,2018-11-21
4,Dolphins vs Lucy the Australopithecus,I ended up in an argument with a friend today ...,evolution,1542845513,EvoSoldior,9,1,True,2018-11-21


In [82]:
conspiracy.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,DEW weapons are not a conspiracy theory but a ...,"Ionispheroc heaters, smart meters, DEW, ect......",conspiracy,1542726123,ilikerealmaplesyrup,16,1,True,2018-11-20
1,Pluto is a planet (and also the Scientific Met...,I feel the demotion of Pluto was a test. To se...,conspiracy,1542726148,CybergothiChe,30,1,True,2018-11-20
2,Are humans really Frugivores? Have we been liv...,I'm not sure about posting this here but I don...,conspiracy,1542726992,Uyrr,98,1,True,2018-11-20
3,FlashBack - Traveling on Delta Flight 1989 on ...,[removed],conspiracy,1542727169,outoftheMultiverse,0,1,True,2018-11-20
4,Discussion Thread Hancock's Stoned Ape Theory ...,Just curious as to what this sub thinks about ...,conspiracy,1542727513,kthxbrit,33,1,True,2018-11-20


In [83]:
evolution.dropna(inplace=True)

In [84]:
evolution.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
1,are animals even remotely conscious?,is the evolution of domesticated animals drivi...,evolution,1542775115,pringlesgalore,8,1,True,2018-11-20
2,baby chameleon has strange adaptation?,watching bbc madagascar on netflix and these t...,evolution,1542776009,pringlesgalore,1,1,True,2018-11-20
3,"The Athletties: Dekey Pete rises, the DJs are ...",[https://theathletic.com/660010/2018/11/18/the...,evolution,1542800440,thigpeneckard,2,1,True,2018-11-21
4,Dolphins vs Lucy the Australopithecus,I ended up in an argument with a friend today ...,evolution,1542845513,EvoSoldior,9,1,True,2018-11-21
5,Is there any examples of actual evolution taki...,Is there any proof of evolution taking place. ...,evolution,1542865599,StockClassic,37,1,True,2018-11-22


In [85]:
conspiracy.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,DEW weapons are not a conspiracy theory but a ...,"Ionispheroc heaters, smart meters, DEW, ect......",conspiracy,1542726123,ilikerealmaplesyrup,16,1,True,2018-11-20
1,Pluto is a planet (and also the Scientific Met...,I feel the demotion of Pluto was a test. To se...,conspiracy,1542726148,CybergothiChe,30,1,True,2018-11-20
2,Are humans really Frugivores? Have we been liv...,I'm not sure about posting this here but I don...,conspiracy,1542726992,Uyrr,98,1,True,2018-11-20
3,FlashBack - Traveling on Delta Flight 1989 on ...,[removed],conspiracy,1542727169,outoftheMultiverse,0,1,True,2018-11-20
4,Discussion Thread Hancock's Stoned Ape Theory ...,Just curious as to what this sub thinks about ...,conspiracy,1542727513,kthxbrit,33,1,True,2018-11-20


In [86]:
both = [conspiracy, evolution]

In [87]:
combined = pd.concat(both)

In [88]:
conspiracy.shape

(2443, 9)

In [89]:
evolution.shape

(887, 9)

In [90]:
combined.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,DEW weapons are not a conspiracy theory but a ...,"Ionispheroc heaters, smart meters, DEW, ect......",conspiracy,1542726123,ilikerealmaplesyrup,16,1,True,2018-11-20
1,Pluto is a planet (and also the Scientific Met...,I feel the demotion of Pluto was a test. To se...,conspiracy,1542726148,CybergothiChe,30,1,True,2018-11-20
2,Are humans really Frugivores? Have we been liv...,I'm not sure about posting this here but I don...,conspiracy,1542726992,Uyrr,98,1,True,2018-11-20
3,FlashBack - Traveling on Delta Flight 1989 on ...,[removed],conspiracy,1542727169,outoftheMultiverse,0,1,True,2018-11-20
4,Discussion Thread Hancock's Stoned Ape Theory ...,Just curious as to what this sub thinks about ...,conspiracy,1542727513,kthxbrit,33,1,True,2018-11-20


In [91]:
combined.shape

(3330, 9)

In [92]:
combined.isnull().sum()

title             0
selftext        173
subreddit         0
created_utc       0
author            0
num_comments      0
score             0
is_self           0
timestamp         0
dtype: int64

In [93]:
combined.dropna(inplace=True)

In [94]:
combined.isnull().sum()

title           0
selftext        0
subreddit       0
created_utc     0
author          0
num_comments    0
score           0
is_self         0
timestamp       0
dtype: int64

In [97]:
ini_bs = BeautifulSoup(combined['selftext'][0])

In [98]:
print(combined['selftext'][0])
print(ini_bs.get_text())

Ionispheroc heaters, smart meters, DEW, ect... and now there just so happened to be rain hitting all the areas where the fires are? 

I won't be surprised in the future if they claim that aliens are attacking us with lasers.

Agenda 2030 is what is happening and it will not stop. Anyone with any sense should understand that 5g milimeter waves are weapons. Radio emf are ised as weapons. How do you think the public would feel if companies put forth 500 billion to install machine guns from towers every 1000 feet claiming it is for faster internet. People are willfully ignorant and refuse to believe. 

Nobody here must pay any attention to radar. Pretty easy to see that weather patterns are always very strange. Floods are about to hit cali because they are letting it happen.
Ionispheroc heaters, smart meters, DEW, ect... and now there just so happened to be rain hitting all the areas where the fires are? 

I won't be surprised in the future if they claim that aliens are attacking us with l

In [185]:
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      ini_bs.get_text() )  # The text to search

link = re.sub('(^https?:\/\/.*[\r\n]*)', '', ini_bs.get_text(), flags=re.MULTILINE)

print(letters_only)
print(link)

Ionispheroc heaters  smart meters  DEW  ect    and now there just so happened to be rain hitting all the areas where the fires are    I won t be surprised in the future if they claim that aliens are attacking us with lasers   Agenda      is what is happening and it will not stop  Anyone with any sense should understand that  g milimeter waves are weapons  Radio emf are ised as weapons  How do you think the public would feel if companies put forth     billion to install machine guns from towers every      feet claiming it is for faster internet  People are willfully ignorant and refuse to believe    Nobody here must pay any attention to radar  Pretty easy to see that weather patterns are always very strange  Floods are about to hit cali because they are letting it happen 
Ionispheroc heaters, smart meters, DEW, ect... and now there just so happened to be rain hitting all the areas where the fires are? 

I won't be surprised in the future if they claim that aliens are attacking us with l

In [186]:
lower_case = letters_only.lower()
words = lower_case.split()

In [187]:
print(lower_case)
print(words)

ionispheroc heaters  smart meters  dew  ect    and now there just so happened to be rain hitting all the areas where the fires are    i won t be surprised in the future if they claim that aliens are attacking us with lasers   agenda      is what is happening and it will not stop  anyone with any sense should understand that  g milimeter waves are weapons  radio emf are ised as weapons  how do you think the public would feel if companies put forth     billion to install machine guns from towers every      feet claiming it is for faster internet  people are willfully ignorant and refuse to believe    nobody here must pay any attention to radar  pretty easy to see that weather patterns are always very strange  floods are about to hit cali because they are letting it happen 
['ionispheroc', 'heaters', 'smart', 'meters', 'dew', 'ect', 'and', 'now', 'there', 'just', 'so', 'happened', 'to', 'be', 'rain', 'hitting', 'all', 'the', 'areas', 'where', 'the', 'fires', 'are', 'i', 'won', 't', 'be', 

In [188]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [189]:
words = [w for w in words if not w in stopwords.words("english")]

In [190]:
print(words)

['ionispheroc', 'heaters', 'smart', 'meters', 'dew', 'ect', 'happened', 'rain', 'hitting', 'areas', 'fires', 'surprised', 'future', 'claim', 'aliens', 'attacking', 'us', 'lasers', 'agenda', 'happening', 'stop', 'anyone', 'sense', 'understand', 'g', 'milimeter', 'waves', 'weapons', 'radio', 'emf', 'ised', 'weapons', 'think', 'public', 'would', 'feel', 'companies', 'put', 'forth', 'billion', 'install', 'machine', 'guns', 'towers', 'every', 'feet', 'claiming', 'faster', 'internet', 'people', 'willfully', 'ignorant', 'refuse', 'believe', 'nobody', 'must', 'pay', 'attention', 'radar', 'pretty', 'easy', 'see', 'weather', 'patterns', 'always', 'strange', 'floods', 'hit', 'cali', 'letting', 'happen']


In [191]:
def selftext_to_words(raw_selftext):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    # 1. Remove HTML
    selftext_text = BeautifulSoup(raw_selftext).get_text()
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", selftext_text)
    link = re.sub('(^https?:\/\/.*[\r\n]*)', '', selftext_text, flags=re.MULTILINE)
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words('english'))
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [192]:
# Get the number of reviews based on the dataframe size.
total_selftext = combined.shape[0]
print(total_selftext)

# Initialize an empty list to hold the clean reviews.
clean_selftext = []
# clean_test_reviews = []

3157


In [193]:
print("Cleaning and parsing the selftext set of combined data...")

j = 0
for _ in combined['selftext']:
    # Convert review to words, then append to clean_train_reviews.
    clean_selftext.append(selftext_to_words(_))
    
    # If the index is divisible by 1000, print a message
    if (j+1) % 1000 == 0:
        print(f'Selftext {j+1} of {total_selftext}.')
    
    j += 1

# Let's do the same for our testing set.

# print("Cleaning and parsing the testing set movie reviews...")

# for test_review in X_test['review']:
#     # Convert review to words, then append to clean_train_reviews.
#     clean_test_reviews.append(review_to_words(test_review))
    
#     # If the index is divisible by 1000, print a message
#     if (j+1) % 1000 == 0:
#         print(f'Review {j+1} of {total_reviews}.')
        
#     j += 1

Cleaning and parsing the selftext set of combined data...


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup

https://youtu.be/Xr0jzB7iSOk?t=909" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup

https://postimg.cc/image/qbq26daqv/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
 

Selftext 1000 of 3157.


  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup

" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup

http://www.khou.com/mobile/article/news/local/verify/verify-rumors-swirl-after-florida-school-shooting/285-519809764

" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' tha

Selftext 2000 of 3157.


  ' that document to Beautiful Soup.' % decoded_markup

" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Selftext 3000 of 3157.


In [71]:
cv = CountVectorizer()

In [72]:
X = cv.fit_transform(combined['selftext'])

In [73]:
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

Unnamed: 0,00,000,00000aab0f6b,00001,00002,00004,00027,001,002,0064,...,zyklon,zziff54eruh11,zzzzzzzzzzzzzzz,ˈheɪdriən,ˈænəˌmi,виноват,изба,плоха,топор,что
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [74]:
combined['combined_title_tokens'] = cv = combined.apply(lambda row: nltk.word_tokenize(row['title']), axis=1)