# SUV Prediction

An automotive comapny has launched a new SUV in the market. Exploring the previous data about the sales of their SUV's such as Gender, Age, Estimated Salary, and whether someone purchased or not, the company wants to predict the people who might be intereted to purchase new model.

In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

%matplotlib inline

In [2]:
def load_file(file):
    return pd.read_csv(file)

In [3]:
dataset_file = "C:\\Users\\mauli\\OneDrive\\Desktop\\DS\\Projects\\DS Projects\\SUV Prediction\\SUV Prediction.csv"

In [4]:
suv_df = load_file(dataset_file)
suv_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [5]:
suv_df.isnull().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

Here, the dataset does not contain any null values.

In [6]:
suv_df.drop('User ID', inplace=True, axis=1)

In [7]:
suv_df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


In [8]:
gender = pd.get_dummies(suv_df['Gender'], drop_first=True)

In [9]:
suv_df = pd.concat([suv_df, gender], axis=1)
suv_df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased,Male
0,Male,19,19000,0,1
1,Male,35,20000,0,1
2,Female,26,43000,0,0
3,Female,27,57000,0,0
4,Male,19,76000,0,1


In [10]:
suv_df.drop('Gender', axis=1, inplace=True)
suv_df.head()

Unnamed: 0,Age,EstimatedSalary,Purchased,Male
0,19,19000,0,1
1,35,20000,0,1
2,26,43000,0,0
3,27,57000,0,0
4,19,76000,0,1


Now, the dataset is ready for train and test.

### Split data into train and test data set

In [11]:
X = suv_df.drop('Purchased', axis=1)
y = suv_df['Purchased']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

### Logistic Regression

In [14]:
LR = LogisticRegression(random_state=42)

In [15]:
LR.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=42, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
y_pred = LR.predict(X_test)

In [17]:
score = accuracy_score(y_test, y_pred)

In [18]:
score_percent = score * 100
score_percent

87.5

### Random Forest

In [19]:
rf = RandomForestClassifier(max_depth=50, n_estimators=30, max_leaf_nodes=12, random_state=42)

In [20]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=50, max_features='auto', max_leaf_nodes=12,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [21]:
y_rf = rf.predict(X_test)

In [22]:
acc = accuracy_score(y_test, y_rf)
perc_score = acc * 100
perc_score

92.5

Hence, Random Forest is chosen for this model.