####CMM705 Big Data Programming Coursework (Sep 2019)

## Airbnb Singapore Machine Learning Model

Binary classification using pyspark and mllib libiries to predict neigbourhood group based on the latitute and longitde features. All the string columns has converted into vectors

#### 01. Exploring The Data

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv('FileStore/tables/listings.csv', header = True, inferSchema = True)
df.printSchema()

totalCount = df.count()

**Input variables:** Lat, Long
**Output variable:** Neigbourhood group

In [6]:
# select input variables and output variables only
df = df.select('latitude','longitude', 'neighbourhood_group')
cols = df.columns
df.printSchema()

In [7]:
# convert str to double
df = df.withColumn('latitude',df['latitude'].cast("double")).withColumn('longitude',df['longitude'].cast("double"))

In [8]:
#drop null values
df = df.dropna()
nullValuesCount = totalCount - df.count()
nullValuesCount

In [9]:
display(df)

latitude,longitude,neighbourhood_group
1.44255,103.7958,North Region
1.33235,103.78521,Central Region
1.44246,103.79667,North Region
1.34541,103.95712,East Region
1.34567,103.95963,East Region
1.34702,103.96103,East Region
1.34348,103.96337,East Region
1.32304,103.91363,East Region
1.32458,103.91163,East Region
1.32461,103.91191,East Region


In [10]:
grouping_data = df.groupby('neighbourhood_group').count()
display(grouping_data)

neighbourhood_group,count
West Region,539
Central Region,6301
North Region,203
East Region,508
North-East Region,344


#### 02. Preparing Data for Machine Learning

In [12]:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

stages = []

label_stringIdx = StringIndexer(inputCol = 'neighbourhood_group', outputCol = 'label')
stages += [label_stringIdx]

assemblerInputs = ['latitude', 'longitude']
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]


In [13]:
# Pipeline
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
df.printSchema()

In [14]:
import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()

Unnamed: 0,0,1,2,3,4
label,4,0,4,2,2
features,"[1.44255, 103.7958]","[1.33235, 103.78521]","[1.44246, 103.79667]","[1.34541, 103.95712]","[1.34567, 103.95963]"
latitude,1.44255,1.33235,1.44246,1.34541,1.34567
longitude,103.796,103.785,103.797,103.957,103.96
neighbourhood_group,North Region,Central Region,North Region,East Region,East Region


In [15]:
# split data for testing and training

train, test = df.randomSplit([0.8, 0.2], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

#### 03. Use the Decision Tree Classifier

In [17]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
dtModel = dt.fit(train)
predictions = dtModel.transform(test)
predictions.select('latitude', 'longitude', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

#### 03. Use the Logistic Regression Model

In [20]:
from pyspark.ml.classification import LogisticRegression

# We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

# Fit the model
mlrModel = mlr.fit(train)

In [21]:
# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))

In [22]:
# Make predictions on the test set

predictions = mlrModel.transform(test)
predictions.select('latitude', 'longitude', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

In [23]:
#Evaluate our Logistic Regression model.
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(predictions))

Since the low accuracy of the prediction I trainned multiple model to compare the most accurate model. So Logistic Regression Model is used for this solution.