# I. Input Data

First thing that we need to do is importing all data that we need. In this project, we will use **pandas** as our main library. This library is used for processing and manipulating data. And about the data itself, we will use phone specifications and price dataset that we obtain from kaggle. 

## About the Data

Like mention earlier, the datasets is obtained from Kaggle. There are two datasets here, **train.csv** and **test.csv**. In train.csv we have all the features and their respective type price. Remember that we don't have the specific price here, we only have the types. Meaning it will only tell if it was expensive, cheap or intermediate. 

In [1]:
import pandas as pd

data = pd.read_csv("train.csv")
data.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


# II. Data Preprocessing

From the snippet of the data that we can see above, we can see that all the data that we have is numeric. And there is no categories data at all, except for the targets. So the preprocessing that we can do is scaling the data straight away. Scikit-learn has various functions that can be used to help us do this. 

But this time, we will build our own objects. In this object, we can use both MinMax Scaler and Standard Scaler. 

In [2]:
class scaler:
    def __init__(self, df):
        self.max = df.max()
        self.min = df.min()
        self.std = df.std()
    def minmax_scaler(self, x):
        return (x - self.min) / (self.max - self.min)
    def standard_scaler(self, x):
        return (x - self.min) / self.std

In [3]:
cols = list(data.columns[:-1])
all_scalers = {}
scaled_data = pd.DataFrame()

for c in cols:
    all_scalers[c] = scaler(data[c])
    scaled_data[c] = all_scalers[c].minmax_scaler(data[c])

In [4]:
scaled_data

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,0.227789,0.0,0.68,0.0,0.052632,0.0,0.080645,0.555556,0.900000,0.142857,0.10,0.010204,0.170895,0.612774,0.285714,0.388889,0.944444,0.0,0.0,1.0
1,0.347361,1.0,0.00,1.0,0.000000,1.0,0.822581,0.666667,0.466667,0.285714,0.30,0.461735,0.993324,0.634687,0.857143,0.166667,0.277778,1.0,1.0,0.0
2,0.041416,1.0,0.00,1.0,0.105263,1.0,0.629032,0.888889,0.541667,0.571429,0.30,0.644388,0.811749,0.627205,0.428571,0.111111,0.388889,1.0,1.0,0.0
3,0.076152,1.0,0.80,0.0,0.000000,0.0,0.129032,0.777778,0.425000,0.714286,0.45,0.620408,0.858478,0.671566,0.785714,0.444444,0.500000,1.0,0.0,0.0
4,0.881764,1.0,0.28,0.0,0.684211,1.0,0.677419,0.555556,0.508333,0.142857,0.70,0.616327,0.475300,0.308658,0.214286,0.111111,0.722222,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.195725,1.0,0.00,1.0,0.000000,1.0,0.000000,0.777778,0.216667,0.714286,0.70,0.623469,0.927904,0.110102,0.571429,0.222222,0.944444,1.0,1.0,0.0
1996,0.977956,1.0,0.84,1.0,0.000000,0.0,0.596774,0.111111,0.891667,0.428571,0.15,0.466837,0.977971,0.474613,0.428571,0.555556,0.777778,1.0,1.0,1.0
1997,0.941884,0.0,0.16,1.0,0.052632,1.0,0.548387,0.666667,0.233333,1.000000,0.15,0.442857,0.755674,0.748530,0.285714,0.055556,0.166667,1.0,1.0,0.0
1998,0.675351,0.0,0.16,0.0,0.210526,1.0,0.709677,0.000000,0.541667,0.571429,0.25,0.171429,0.113485,0.163816,0.928571,0.555556,0.944444,1.0,1.0,1.0


# III. Model Building

The last thing that we need to do is making the classifier. In this project, the model that we choose is Random Forest. We will use Scikit-learn to make and train it then at the end we will use joblib to save them. Later, in our application, we will use this model that is saved by joblib.  

In [5]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier().fit(scaled_data, data['price_range'])

In [6]:
from sklearn.metrics import accuracy_score

predictions = rf_model.predict(scaled_data)
accuracy = accuracy_score(predictions, data['price_range'])

In [7]:
accuracy

1.0

In [8]:
import joblib

joblib.dump(rf_model, "model/rf_model.sav")

['model/rf_model.sav']

# IV. Conclusions

A Random Forest Classifier has been built using Scikit-learn and saved using Joblib. This model obtained 100% accuracy. 