# Potential Innovative data to analyze Real Estate

To analyse the investment opportunities of Real Estate, we have many kinds of data sources to analyse the properties.

## Tradition:
### 1. The variables about the Properties themselves
Location, sizes, property types(studios/apartments/houses/others), Floor Area Ratio，Greening Rates, facilities, debts/obligations and etc.

### 2. The variables about the External conditions
Neighborhood culture, Good view, traffic accessibility, recreational centers, infrastructures, schools, hospital and ect.

### 3. The variables about the Macroecnomic environment
Annual GDP per capita growth rate in the regions, Unemployment rate in the regions and etc.

## Innovation:
Besides, we also have other new data sources nowadays for deeper analysis.
### 1. Real Estate Agent data
For some Real Estate Agents like Zillow, Airbnb and Redfin, they have many open source comments data for us to extract and analyse the comments of certain property.
### 2. Real Estate Advisory Report
For some Real Estate Consulting firm, they write lots of reports to analyse the regional ecnomics and the real estate markets or even the commercial zone. We could scrape the useful information for our analysis.
### 3. Social Media data
#### I) Comments of the properties themselves
We could scrape the open source social media data(Yelp, Instagram, Facebook, Twitter and etc.) to dive deep into the comments of the people, who might be the potential/existing customers.
#### II) Comments of facilities in the neighborhood
Moreover, the facilities in the neighborhood like a recreational center, good views and etc. will also have great premium of the properties. We could also analyse the comments of the facilities.

## Basic Application
In this notebook, I try to use the 100 comments of our **USC Hotel** scraped in yelp to analyse the major Pros and Cons of it.

In [1]:
##import os
##os.chdir('D:\\text analytics')

In [2]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
data = pd.read_excel("USC_Hotel_100_comments.xlsx", encoding='latin-1')
data.head()

Unnamed: 0,title,title_link,user-location,user-passport-stats,user-passport-stats1,review-content,rating-qualifier
0,Isabel N.,https://www.yelp.com/user_details?userid=FEtQg...,"Yonkers, NY",112 friends,25 reviews,"Overall hotel is okay. Pros - friendly staff, ...",3/15/2019
1,Amy H.,https://www.yelp.com/user_details?userid=k8vCC...,"Chicago, IL",187 friends,188 reviews,I was visiting USC and stayed at the USC hotel...,3/5/2019
2,Dong Z.,https://www.yelp.com/user_details?userid=b8sAd...,"University City, San Diego, CA",7 friends,55 reviews,People like a hotel for different reasons so m...,2/2/2019
3,Sara G.,https://www.yelp.com/user_details?userid=Zlo5S...,"Santa Maria, CA",15 friends,23 reviews,This review is based on the fact that this was...,10/14/2018
4,Kevin g.,https://www.yelp.com/user_details?userid=zivb-...,"Sparks, NV",0 friends,1 review,Outdated. $250 hold Lost cable tv during stay...,10/22/2018


### The dataset includes the name, locations and the links of people who comment the properties. Besides, it also includes the details of their comments.

In [3]:
# 68 unique locations of 100 data, so it might not be meaningful analyse the comments location by location.
len(data['user-location'].unique()) 

68

We decide (2,3) grams, token pattern, max_df and stop words to avoid analyzing the meaningless words ('the','a','he','she') and simplify our analytical process

In [4]:
vectorizer = TfidfVectorizer(ngram_range=(2,4),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.4, stop_words = set(stopwords.words('english') + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"]),binary=True)

In [5]:
from nltk.stem import WordNetLemmatizer #Lemmatizer
lemmatizer = WordNetLemmatizer()
reviews = []
for r in data['review-content']:
    token_words = nltk.word_tokenize(r)
    lemmatize_tokens=[]
    for token in token_words:
        lemmatize_tokens.append(lemmatizer.lemmatize(token))
    reviews.append(" ".join(lemmatize_tokens))

We use **TFIDF** (Term Frequency-Inverse Document Frequency) to rank the words

In [6]:
X = vectorizer.fit_transform(reviews)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)
score=pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)

In [7]:
score[score['score']>0.3]

Unnamed: 0,score
front desk,0.769064
great location,0.556176
room clean,0.49248
walking distance,0.474433
parking structure,0.460798
usc campus,0.404201
next door,0.390182
right next,0.384471
customer service,0.378303
across street,0.37796


### Conclusion
For the basic application by using the TFIDF ranking of the USC Hotel comments, we find that people give good comments than bad comments. It is majorly because it has great location (walking distance to USC), rooms are clean, staff are nice and the beds are comfortable.

### Further Analysis Suggesion
About the properties themselves: If we use more data, we can label the data of goods and bads comments and split data into training and testing dataset to do the **classification analysis** to further study the major good/bad reasons for the properties that make them perform well/bad.

**About the facilities in the neighborhood**: we can also analyse comments of the facilities in the neighborhood, University of Southern Califonia, for example to give a certain score of it and we can combine all the scores of different facilities to do the **spatial analysis**.