###Data Flow

# ![Data Flow](files/tables/dataflow.jpg)

###Flow of the Notebook

1. About Dataset
2. Reading Data
3. Data Visualisation
4. Feature Engineering
5. Creating Machine Learning Models - Decision Tree, Random Forest, Navie Bayes
6. Model Comparison for Classification models
7. Business Conclusion

###About Dataset

#####Overview

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.
He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.
Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

#####Metadata

1. id - id
2. battery_power - Total energy a battery can store in one time measured in mAh
3. blue - Has bluetooth or not
4. clock_speed - speed at which microprocessor executes instructions
5. dual_sim - Has dual sim support or not
6. fc - Front Camera mega pixels
7. four_g - Has 4G or not
8. int_memory - Internal Memory in Gigabytes
9. m_dep - Mobile Depth in cm
10. mobile_wt - Weight of mobile phone
11. n_cores - Number of cores of processor
12. pc - Primary Camera mega pixels
13. px_height - Pixel Resolution Height
14. px_width - Pixel Resolution Width
15. ram - Random Access Memory in Megabytes
16. sc_h - Screen Height of mobile in cm
17. sc_w - Screen Width of mobile in cm
18. talk_time - longest time that a single battery charge will last when you are
19. three_g - Has 3G or not
20. touch_screen - Has touch screen or not
21. wifi - Has wifi or not

dataset source - https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification

In [0]:
#importing required libraries
from pyspark.sql import SparkSession
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import functions as F
from pyspark.ml.classification import DecisionTreeClassifier,RandomForestClassifier,NaiveBayes
from pyspark.ml.feature import VectorAssembler,StringIndexer,StandardScaler,Normalizer
from pyspark.ml import Pipeline
from sklearn.metrics import confusion_matrix
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.clustering import KMeans

### Reading Data

In [0]:
# File location and type
file_location = "/FileStore/tables/mobile_prices_train.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
 
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1,1
1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0,2
563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0,2
615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0,2
1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0,1
1859,0,0.5,1,3,0,22,0.7,164,1,7,1004,1654,1067,17,1,10,1,0,0,1
1821,0,1.7,0,4,1,10,0.8,139,8,10,381,1018,3220,13,8,18,1,0,1,3
1954,0,0.5,1,0,0,24,0.8,187,4,0,512,1149,700,16,3,5,1,1,1,0
1445,1,0.5,0,0,0,53,0.7,174,7,14,386,836,1099,17,1,20,1,0,0,0
509,1,0.6,1,2,1,9,0.1,93,5,15,1137,1224,513,19,10,12,1,0,0,0


In [0]:
#creating temporary local table
temp_table_name = "mobile_prices"
df.createOrReplaceTempView(temp_table_name)
df.printSchema()

root
 |-- battery_power: integer (nullable = true)
 |-- blue: integer (nullable = true)
 |-- clock_speed: double (nullable = true)
 |-- dual_sim: integer (nullable = true)
 |-- fc: integer (nullable = true)
 |-- four_g: integer (nullable = true)
 |-- int_memory: integer (nullable = true)
 |-- m_dep: double (nullable = true)
 |-- mobile_wt: integer (nullable = true)
 |-- n_cores: integer (nullable = true)
 |-- pc: integer (nullable = true)
 |-- px_height: integer (nullable = true)
 |-- px_width: integer (nullable = true)
 |-- ram: integer (nullable = true)
 |-- sc_h: integer (nullable = true)
 |-- sc_w: integer (nullable = true)
 |-- talk_time: integer (nullable = true)
 |-- three_g: integer (nullable = true)
 |-- touch_screen: integer (nullable = true)
 |-- wifi: integer (nullable = true)
 |-- price_range: integer (nullable = true)



### Data Visualization

In [0]:
df.groupBy('price_range').count().display()

price_range,count
1,500
3,500
2,500
0,500


Sanity Check - Here we can see that we don't have baised data, our dataset is equally distributed across all the price ranges

In [0]:
df.groupBy('price_range').avg('battery_power').alias('Average Battery Power').orderBy('price_range').display()

price_range,avg(battery_power)
0,1116.902
1,1228.868
2,1228.32
3,1379.984


In [0]:
df.groupBy('price_range').avg('ram').orderBy('price_range').display()

price_range,avg(ram)
0,785.314
1,1679.49
2,2582.816
3,3449.232


In [0]:
display(df.select('n_cores','clock_speed'))

n_cores,clock_speed
2,2.2
3,0.5
5,0.5
6,2.5
2,1.2
1,0.5
8,1.7
4,0.5
7,0.5
5,0.6


In [0]:
df.groupBy('price_range').avg('pc').orderBy('price_range').display()

price_range,avg(pc)
0,9.574
1,9.924
2,10.018
3,10.15


In [0]:
df.groupBy('price_range').avg('int_memory').display()

price_range,avg(int_memory)
1,32.116
3,33.976
2,30.92
0,31.174


In [0]:
# this function will take the dataframe with averages for each prediction and will check if the average ram falls in the right group.
# For price range 0, avg. ram is below 1000
# For price range 1, avg ram is below 2000
# For price range 2, avg. ram is below 3000
# Else it will be price range 3
def sanityCheckRam(df,columnName):
    return df.withColumn('Sanity',F.when(F.col(columnName)<1000,0).when(F.col(columnName)<2000,1).when(F.col(columnName)<3000,2).otherwise(3))

In [0]:
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['battery_power','clock_speed', 'fc','int_memory','m_dep','mobile_wt','n_cores','pc','px_height',
                                      'px_width','ram','sc_h','sc_w','talk_time'], outputCol=vector_col)
myGraph_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col).collect()[0][0]
corrmatrix = matrix.toArray().tolist()
cf = spark.createDataFrame(corrmatrix,['battery_power','clock_speed', 'fc','int_memory','m_dep','mobile_wt','n_cores','pc','px_height',
                                      'px_width','ram','sc_h','sc_w','talk_time'])
cf.display()



battery_power,clock_speed,fc,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time
1.0,0.0114815937415474,0.0333343850091244,-0.0040036799108913,0.0340845149028673,0.0018443762145708,-0.0297272589107314,0.0314406772433109,0.0149008045446747,-0.0084018304990439,-0.0006529264469275979,-0.0299586014679883,-0.0214209432942237,0.0525103546575949
0.0114815937415474,1.0,-0.0004338982076606422,0.0065451451529243,-0.0143643603092998,0.0123497478665861,-0.0057242260337928,-0.0052450375615914,-0.0145228950868494,-0.0094756530504301,0.003443030688089,-0.0290776218063468,-0.0073783555432141,-0.0114318655598128
0.0333343850091244,-0.0004338982076606422,1.0,-0.0291327752451993,-0.0017911308477027,0.0236180170968171,-0.0133562434429675,0.6445952827956346,-0.0099899074836979,-0.0051756343844266,0.0150989704984773,-0.0110139546950934,-0.0123725913079929,-0.0068286493669754
-0.0040036799108913,0.0065451451529243,-0.0291327752451993,1.0,0.0068857551509572,-0.0342142091354753,-0.0283104166898515,-0.0332733888548644,0.0104412567756179,-0.0083348533723946,0.0328131740809153,0.0377711327973655,0.0117305344103692,-0.002790288991187
0.0340845149028673,-0.0143643603092998,-0.0017911308477027,0.0068857551509572,1.0,0.021756065251581,-0.0035038859268238,0.0262824394440505,0.0252628698603812,0.0235663459060418,-0.0094341206470101,-0.0253478057854241,-0.018388117499988,0.0170025529505586
0.0018443762145708,0.0123497478665861,0.0236180170968171,-0.0342142091354753,0.021756065251581,1.0,-0.0189887572050492,0.0188439298352361,0.000939323980448404,8.976163925566415e-05,-0.0025805417072914,-0.0338547136814373,-0.0207605250887393,0.0062085007585923
-0.0297272589107314,-0.0057242260337928,-0.0133562434429675,-0.0283104166898515,-0.0035038859268238,-0.0189887572050492,1.0,-0.0011926142129104,-0.0068720733908025,0.0244799170247615,0.0048683257447985,-0.00031483547910172484,0.0258264541680811,0.0131478639078402
0.0314406772433109,-0.0052450375615914,0.6445952827956346,-0.0332733888548644,0.0262824394440505,0.0188439298352361,-0.0011926142129104,1.0,-0.0184654942783753,0.004195944306608,0.0289835165244201,0.0049375198644699,-0.0238192379203956,0.0146569851601641
0.0149008045446747,-0.0145228950868494,-0.0099899074836979,0.0104412567756179,0.0252628698603812,0.000939323980448404,-0.0068720733908025,-0.0184654942783753,1.0,0.5106644191393128,-0.020351922754965,0.0596152944608767,0.0430382681055888,-0.0106452534670933
-0.0084018304990439,-0.0094756530504301,-0.0051756343844266,-0.0083348533723946,0.0235663459060418,8.976163925566415e-05,0.0244799170247615,0.004195944306608,0.5106644191393128,1.0,0.0041052164672186,0.0215985621119942,0.0346991955857733,0.006719940879503


Since there is high correlation between ( px_height and px_width ) and ( sc_h and sc_w ), we won't be using those columns directly rather we will be calculating pixel screen resolution and screen resolution using below formula, 
1. pixel screen resolution = (px_height) X (px_width)
2. screen resolution = (sc_h) X (sc_w)

Lastly, we will be dropping talktime as it is derivative of battery power.

### Feature Engineering

In [0]:
#dropping missing values
df=df.dropna()

#calculating screa area and pixel area based on the respective height and width columns
df=df.withColumn('sc_area',F.col('sc_h')*F.col('sc_w')).withColumn('px_area',F.col('px_height')*F.col('px_width'))

#combining four_g and three_g column into one column with granular values of 2G, 3G and 4G
df = df.withColumn('network',F.when(F.col('four_g')==1,'4G').when(F.col('three_g')==1 ,'3G').otherwise('2G'))

#Dropping the below columns
df = df.drop("four_g","three_g","px_height","px_width", "sc_h", "sc_w", "talk_time")

#displaying final dataset that would be used for creating model
df.display()

battery_power,blue,clock_speed,dual_sim,fc,int_memory,m_dep,mobile_wt,n_cores,pc,ram,touch_screen,wifi,price_range,sc_area,px_area,network
842,0,2.2,0,1,7,0.6,188,2,2,2549,0,1,1,63,15120,2G
1021,1,0.5,1,0,53,0.7,136,3,6,2631,1,0,2,51,1799140,4G
563,1,0.5,1,2,41,0.9,145,5,6,2603,1,0,2,22,2167308,4G
615,1,2.5,0,0,10,0.8,131,6,9,2769,0,0,2,128,2171776,3G
1821,1,1.2,0,13,44,0.6,141,2,14,1411,1,0,1,16,1464096,4G
1859,0,0.5,1,3,22,0.7,164,1,7,1067,0,0,1,17,1660616,3G
1821,0,1.7,0,4,10,0.8,139,8,10,3220,0,1,3,104,387858,4G
1954,0,0.5,1,0,24,0.8,187,4,0,700,1,1,0,48,588288,3G
1445,1,0.5,0,0,53,0.7,174,7,14,1099,0,0,0,17,322696,3G
509,1,0.6,1,2,9,0.1,93,5,15,513,0,0,0,190,1391688,4G


In [0]:
# Create a 70-30 train test split
train_data,test_data=df.randomSplit([0.7,0.3])

###Creating Decision Tree

In [0]:
# Use StringIndexer to convert the categorical columns to hold numerical data
network_indexer = StringIndexer(inputCol='network',outputCol='network_index',handleInvalid='keep')

# Vector assembler is used to create a vector of input features
assembler = VectorAssembler(inputCols=['battery_power','blue','clock_speed','dual_sim','fc','int_memory','m_dep','mobile_wt','n_cores','pc','px_area',
                                       'sc_area','ram','touch_screen','wifi','network_index'],outputCol="features")

# Create an object for the Decision Tree model
# Use the parameter maxBins and assign a value that is equal to or more than the number of categories in any sigle feature
dt_model = DecisionTreeClassifier(labelCol='price_range',maxBins=5000)

# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data
pipe = Pipeline(stages=[network_indexer,assembler,dt_model])

fit_model=pipe.fit(train_data)
results = fit_model.transform(test_data)

In [0]:
results.select(['price_range','prediction']).show()

+-----------+----------+
|price_range|prediction|
+-----------+----------+
|          0|       0.0|
|          2|       2.0|
|          1|       0.0|
|          2|       2.0|
|          1|       1.0|
|          1|       1.0|
|          2|       2.0|
|          0|       0.0|
|          3|       3.0|
|          0|       0.0|
|          0|       0.0|
|          2|       3.0|
|          1|       1.0|
|          0|       1.0|
|          0|       0.0|
|          2|       2.0|
|          0|       0.0|
|          2|       2.0|
|          2|       2.0|
|          1|       1.0|
+-----------+----------+
only showing top 20 rows



In [0]:
results.groupby('prediction').count().sort('prediction').show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|  153|
|       1.0|  163|
|       2.0|  118|
|       3.0|  173|
+----------+-----+



#####Evaluating Decision Tree model

In [0]:
y_true = results.select("price_range")
y_true = y_true.toPandas()

y_pred = results.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix \n {}".format(cnf_matrix))

Below is the confusion matrix 
 [[141  11   0   0]
 [ 12 125  11   0]
 [  0  27  95  24]
 [  0   0  12 149]]


Confusion matrix shows that our model has predicted the classes pretty accurately. Our model was more accurately able to predict price range for 0,1 class with the best misidentification count, for classes 1 and 2,the misidentification count was the highest.

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(labelCol="price_range", predictionCol="prediction", metricName="accuracy")
d_accuracy = ACC_evaluator.evaluate(results)
print("The accuracy of the decision tree classifier is {}".format(d_accuracy))

f1_evaluator = MulticlassClassificationEvaluator(labelCol="price_range", predictionCol="prediction", metricName="f1")
d_f1 = f1_evaluator.evaluate(results)
print("F1 Score is {}".format(d_f1))

precision_evaluator = MulticlassClassificationEvaluator(labelCol="price_range", predictionCol="prediction", metricName="precisionByLabel")
d_precision = precision_evaluator.evaluate(results)
print("Precision is {}".format(d_precision))

recall_evaluator = MulticlassClassificationEvaluator(labelCol="price_range", predictionCol="prediction", metricName="precisionByLabel")
d_recall = recall_evaluator.evaluate(results)
print("Recall is {}".format(d_recall))

The accuracy of the decision tree classifier is 0.8401976935749588
F1 Score is 0.8372837399898917
Precision is 0.9215686274509803
Recall is 0.9215686274509803


In [0]:
sanityCheckRam(results.groupBy('prediction').agg(F.avg('ram').alias('ramAvg')),'ramAvg').display()

prediction,ramAvg,Sanity
0.0,805.8823529411765,0
1.0,1754.840490797546,1
3.0,3417.791907514451,3
2.0,2534.906779661017,2


According to the sanity check function defined in the program, if ram is below 1000, it should be in class 0, if it is between 1000 and 2000 it should be in class 1, between 2000 and 3000, it should be in class 2 otherwise class 3. If we pass the avgRam value of our predictions to the sanity check function, we can see that we are getting accurate results, as the sanity value matches our prediction which proves the validity of our model.

###Building Random Forest model

In [0]:
#creating random forest classifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'price_range')

pipe = Pipeline(stages=[network_indexer,assembler,rf])

fit_model=pipe.fit(train_data)

results2 = fit_model.transform(test_data)

In [0]:
results2.select(['price_range','prediction']).show()

+-----------+----------+
|price_range|prediction|
+-----------+----------+
|          0|       1.0|
|          2|       2.0|
|          1|       1.0|
|          2|       2.0|
|          1|       1.0|
|          1|       2.0|
|          2|       2.0|
|          0|       0.0|
|          3|       3.0|
|          0|       0.0|
|          0|       0.0|
|          2|       3.0|
|          1|       2.0|
|          0|       0.0|
|          0|       0.0|
|          2|       2.0|
|          0|       0.0|
|          2|       2.0|
|          2|       2.0|
|          1|       1.0|
+-----------+----------+
only showing top 20 rows



In [0]:
results2.groupby('prediction').count().sort('prediction').show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|  163|
|       1.0|  144|
|       2.0|  145|
|       3.0|  155|
+----------+-----+



#####Evaluating Random Forest model

In [0]:
y_true = results2.select("price_range")
y_true = y_true.toPandas()

y_pred = results2.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix \n {}".format(cnf_matrix))

Below is the confusion matrix 
 [[142  10   0   0]
 [ 21 103  24   0]
 [  0  31 102  13]
 [  0   0  19 142]]


Confusion matrix shows that our model has predicted the classes pretty accurately. Our model was more accurately able to predict price range for 0,1 class with the best misidentification count, for classes 1 and 2,the misidentification count was the highest.

In [0]:
r_accuracy = ACC_evaluator.evaluate(results2)
print("The accuracy of the Random Forest classifier is {}".format(accuracy))

r_f1 = f1_evaluator.evaluate(results2)
print("F1 Score is {}".format(f1))

r_precision = precision_evaluator.evaluate(results2)
print("Precision is {}".format(precision))

r_recall = recall_evaluator.evaluate(results2)
print("Recall is {}".format(recall))

The accuracy of the Random Forest classifier is 0.39473684210526316
F1 Score is 0.38222221521302935
Precision is 0.5333333333333333
Recall is 0.5333333333333333


In [0]:
sanityCheckRam(results2.groupBy('prediction').agg(F.avg('ram').alias('ramAvg')),'ramAvg').display()

prediction,ramAvg,Sanity
0.0,817.4110429447853,0
1.0,1685.8958333333333,1
3.0,3445.2129032258063,3
2.0,2687.731034482759,2


According to the sanity check function defined in the program, if ram is below 1000, it should be in class 0, if it is between 1000 and 2000 it should be in class 1, between 2000 and 3000, it should be in class 2 otherwise class 3. If we pass the avgRam value of our predictions to the sanity check function, we can see that we are getting accurate results, as the sanity value matches our prediction which proves the validity of our model.

### Building Naives Bayes Model

In [0]:
#creating naive bayes classifier
nb = NaiveBayes(labelCol="price_range", featuresCol="features")
 
#Using scaler and normalizer to improve accuracy and fit of the model
#The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1
scaler = StandardScaler(inputCol="features",outputCol="scaled_features")
 
#Normalize a vector to have unit norm using the given p-norm.
normalizer = Normalizer(inputCol="scaled_features", outputCol="norm_features", p=1.0)

pipeline = Pipeline(stages=[network_indexer,assembler,scaler,normalizer, nb])

nbModel = pipeline.fit(train_data)
results3 = nbModel.transform(test_data)

In [0]:
# Displaying predictions by the naive bayes model
results3.select("price_range", "prediction").show()

+-----------+----------+
|price_range|prediction|
+-----------+----------+
|          0|       1.0|
|          2|       1.0|
|          1|       1.0|
|          2|       1.0|
|          1|       2.0|
|          1|       3.0|
|          2|       2.0|
|          0|       2.0|
|          3|       1.0|
|          0|       1.0|
|          0|       1.0|
|          2|       2.0|
|          1|       2.0|
|          0|       2.0|
|          0|       2.0|
|          2|       3.0|
|          0|       2.0|
|          2|       2.0|
|          2|       3.0|
|          1|       1.0|
+-----------+----------+
only showing top 20 rows



####Evaluating Naive Bayes Model

In [0]:
y_true = results3.select("price_range")
y_true = y_true.toPandas()

y_pred = results3.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix \n {}".format(cnf_matrix))

Below is the confusion matrix 
 [[84 14 54  0]
 [18 40 84  6]
 [ 4 47 79 16]
 [ 0 59 75 27]]


Confusion matrix shows that our model's prediction is not very accurate. Our model high misidentification count (False Positive + False Negative) was (55+8 = 63) for classes 1 and 2,the misidentification count was the highest(69+20 = 89) and for classes 2 and 3, misidentification count was (61+25 = 86).

In [0]:
n_accuracy = ACC_evaluator.evaluate(results3)
print("The accuracy of the Naive Bayes classifier is {}".format(n_accuracy))

n_f1 = f1_evaluator.evaluate(results3)
print("F1 Score is {}".format(n_f1))

n_precision = precision_evaluator.evaluate(results3)
print("Precision is {}".format(n_precision))

n_recall = recall_evaluator.evaluate(results3)
print("Recall is {}".format(n_recall))

The accuracy of the Naive Bayes classifier is 0.37891268533772654
F1 Score is 0.3813590927417984
Precision is 0.7924528301886793
Recall is 0.7924528301886793


In [0]:
sanityCheckRam(results3.groupBy('prediction').agg(F.avg('ram').alias('ramAvg')),'ramAvg').display()

prediction,ramAvg,Sanity
0.0,678.1509433962265,0
1.0,2388.4125,2
3.0,3148.285714285714,3
2.0,2367.945205479452,2


According to the sanity check function defined in the program, if ram is below 1000, it should be in class 0, if it is between 1000 and 2000 it should be in class 1, between 2000 and 3000, it should be in class 2 otherwise class 3. If we pass the avgRam value of our predictions to the sanity check function, we can see that we are getting accurate results, as the sanity value does not match our prediction which proves that the model has failed.

###Model Comparison for Classification models

In [0]:
spark = SparkSession.builder.getOrCreate()
 
columns = ['Model Name', 'Accuracy']
vals = [('Decision Tree', (d_accuracy*100)), ('Random Forest', (r_accuracy*100)),('Naive Bayes', (n_accuracy*100))]
 
df = spark.createDataFrame(vals, columns)
display(df)

Model Name,Accuracy
Decision Tree,84.01976935749587
Random Forest,80.56013179571664
Naive Bayes,37.89126853377265


#####After comparing all the models, we can see that the Navie Bayes model performs very poorly compared to other 2 models. 
#####Since the decision tree models is giving the highest accuracy, precision, recall and F1 score this model can be used for deployment and real world prediction.

###Business Conclusion

#####Based on our analysis, mobile features used in model creation are very important deciding factors while estimating the price range for the device. 
#####Based on our statistical analysis, features like battery power, clock speed, internal memory, number of cores, ram are the most important features for prediction.
#####For any company entering the market can use this machine learning model to estimate the price range for their device based on these features. 
#####This is a very significant problem which is solved using machine learning and can be used by the industry to predict price range for any device.