# **Project Name**    - Mobile Price Range Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual

# **Project Summary -**

Write the summary here within 500-600 words.

In the age of rapidly evolving technology, smartphones have become an integral part of our lives. As consumers have a wide range of options to choose from, predicting the price range of mobile phones accurately can provide valuable insights to manufacturers, retailers, and consumers alike. The "Mobile Price Range Prediction" project aims to develop predictive models that can estimate the price range of mobile phones based on their specifications and features.

The main objective of this project is to create accurate and robust machine learning models that can predict the price range of mobile phones based on various attributes such as battery power , dual sim , RAM, storage, and other technical specifications.

The project follows a structured methodology that includes data collection, preprocessing, feature engineering, model selection, training, and evaluation. The dataset used for this project contains information about  mobile phones features and corresponding price ranges.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/padhilipika



# **Problem Statement**


**Write Problem Statement Here.**

In the competitive mobile phone market companies want to understand sales data of mobile phones
and factors which drive the prices.
The objective is to find out some relation between features of a mobile phone(eg:- RAM,Internal
Memory, etc) and its selling price. In this problem, we do not have to predict theactual price but a
price range indicating how high the price is.

#### **Define Your Business Objective?**

Answer Here.

Develop predictive models to accurately estimate the price range of mobile phones based on their specifications, empowering manufacturers, and consumers with valuable insights for informed decision-making in the smartphone market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('datamobilepricerange.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
# Dataset First Look( Viewing the last 5 rows)
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
print(f'number of rows : {df.shape[0]}  \nnumber of columns : {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
#Description of the data
df.describe()
df.describe().T # transpose

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

There are no duplicate values

In [None]:
#The nunique () method returns the number of unique values for each column
df.nunique()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values


There are no missing values in the data set

### What did you know about your dataset?

Answer Here

Our dataset comprises comprehensive information about various mobile phones, including their technical specifications and corresponding price ranges. It encompasses attributes such as battery power, camera quality, RAM, storage, and more. Through data exploration, we gained insights into the range, distribution, and interrelationships of these features. This understanding lays the foundation for building predictive models that can estimate mobile phone price ranges accurately.

There are 2000 rows and 21 columns in the dataset. The dataset does not contain any duplicate values.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

*  **Battery_power** - Total energy a battery can store in one time measured in mAh
*   **Blue** - Has bluetooth or not
*   **Clock_speed** - speed at which microprocessor executes instructions
*   **Dual_sim** - Has dual sim support or not
*   **Fc** - Front Camera mega pixels
*  **Four_g** - Has 4G or not
*  **Int_memory** - Internal Memory in Gigabytes
*   **M_dep** - Mobile Depth in cm
*   **Mobile_wt** - Weight of mobile phone
*   **N_cores** - Number of cores of processor
*   **Pc** - Primary Camera mega pixels
*   **Px_height** - Pixel Resolution Height
*   **Px_width** - Pixel Resolution Width
*   **Ram** - Random Access Memory in Mega
*   **Touch_screen** - Has touch screen or not
*   **Wifi** - Has wifi or not
*  **Sc_h** - Screen Height of mobile in cm
*   **Sc_w** - Screen Width of mobile in cm
*   **Talk_time** - longest time that a single battery charge will last when you are
*   **Three_g** - Has 3G or not
*   **Wifi** - Has wifi or not
*  **Price_range** - This is the target variable with value of 0(low cost), 1(medium cost),2(high cost) and 3(very high cost).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in",i,"is",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dataset First Look (Viewing the first 5 rows)
df.head()
# Dataset last Look (Viewing the first 5 rows)
df.tail()

### What all manipulations have you done and insights you found?

Answer Here.

Through classification machine learning methods, we performed feature scaling to ensure consistent scale across attributes. We also handled class imbalance using techniques like oversampling and undersampling to improve model performance. Insights revealed that specific features, such as camera quality and RAM, hold significant importance in predicting mobile phone price ranges. Additionally, decision tree analysis highlighted feature thresholds that differentiate different price categories effectively.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**price**

In [None]:
# Chart - 1 visualization code
sns.set()
price_plot=df['price_range'].value_counts().plot(kind='bar')
plt.xlabel('price_range')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

This vertical bar chart is used to visualize the distribution of the target variable (price ranges) in a classification project. It provides an overview of how instances are distributed among different classes, which is crucial for understanding class imbalance and data distribution.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

There are mobile phones in 4 price ranges. The number of elements is almost similar.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights help creating a positive business impact  by aligning products with market demands. However, skewed distributions could signal areas of potential concern or negative growth if not appropriately addressed.

#### Chart - 2

**Battery power**

In [None]:
# Chart - 2 visualization code
sns.set(rc={'figure.figsize':(5,5)})
ax=sns.displot(df["battery_power"])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Histograms are used to visualize the distribution of a single numerical variable. In a classification project, this chart helps understand the distribution of the "battery_power" feature, which can aid in identifying patterns related to price range categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

This plot shows how the battery mAh is spread.
 there is a gradual increase as the price range increases

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can potentially lead to a positive business impact by revealing how battery power affects price ranges. Manufacturers can use this information to optimize battery specifications for different price segments, potentially increasing sales and market competitiveness.

#### Chart - 3

**Bluetooth**

In [None]:
# Chart - 3 visualization code
fig,ax=plt.subplots(figsize=(10,5))
sns.barplot(data=df,x='blue',y='price_range',ax=ax)

##### 1. Why did you pick the specific chart?

Answer Here.

This Bar Plot  chart is used to visualize the relationship between a categorical feature (blue) and the target variable (price_range).It displays the average value of the target variable for each category of the feature.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Half the devices have Bluetooth, and half don’t.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can create a positive business impact by helping businesses understand how specific feature categories influence price ranges, enabling them to make informed decisions. However, if any feature category exhibits consistently low price ranges, it could indicate negative growth potential in that category, potentially prompting adjustments in product offerings or marketing strategies.

#### Chart - 4

**Ram**

In [None]:
# Chart - 4 visualization code
df.plot(x='price_range',y='ram',kind='scatter')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Scatter plots help visualize the relationship between two numeric variables, showing how they might be distributed among different class labels.

In this specific scatter plot, the x-axis represents the "price_range" class labels, and the y-axis represents the "ram" attribute. The plot displays how RAM values are distributed across different price range categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Ram has continuous increase with price range while moving from Low cost to Very high cost.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from this scatter plot could have a positive business impact by revealing patterns such as higher RAM being associated with higher price categories, enabling manufacturers to make informed decisions on feature offerings.

Negative growth insights could arise if the scatter plot reveals a lack of correlation between RAM and price range, potentially indicating that RAM isn't a significant determinant of pricing, which might impact certain marketing or pricing strategies.

#### Chart - 5

**pixel_width**

In [None]:
# Chart - 5 visualization code
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.kdeplot(data=df, x='px_width', hue='price_range', ax=axs[0])
sns.boxplot(data=df, x='price_range', y='px_width', ax=axs[1])
plt.show()

There is not a continuous increase in pixel width as we move from Low cost to Very high cost. Mobiles with 'Medium cost' and 'High cost' has almost equal pixel width.
so we can say that it would be a driving factor in deciding price_range.


In [None]:
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.kdeplot(data=df, x='px_height', hue='price_range', ax=axs[0])
sns.boxplot(data=df, x='price_range', y='px_height', ax=axs[1])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The left chart is a Kernel Density Estimate (KDE) plot, which displays the distribution of 'px_width',px_height attribute for different 'price_range' categories. The right chart is a Box Plot, showcasing the relationship between 'price_range' and 'px_width','price_range' and 'px_height'. These charts are useful in a classification project to visualize the distribution and relationship between features across different classes. The insights gained can aid in making informed decisions, such as identifying feature value ranges that contribute to specific price ranges.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Pixel height is almost similar as we move from Low cost to Very high cost.little variation in pixel_height

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights help creating a positive impact  can arise from targeted marketing and tailored product offerings. Negative growth insights might occur if certain feature values overlap across price categories, leading to less distinctive class boundaries, potentially affecting pricing strategies or market segmentation.

#### Chart - 6

**FC (front camera megapixels)**

In [None]:
# Chart - 6 visualization code
df.plot(x='price_range',y='fc',kind='scatter')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Scatter plots are used in classification projects to visualize the relationship between two numerical variables (in this case, 'price_range' and 'fc'). In this specific scenario, it helps visualize how the feature 'fc' (front camera quality) varies across different price ranges of mobile phones.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

This features distribution is almost similar along all the price ranges variable, it may not be helpful in making predictions

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Gained insights can positively impact business decisions by identifying trends in feature importance, allowing manufacturers to prioritize attributes that align with consumer preferences. Negative growth insights might occur if high-priced phones with lower front camera quality receive unfavorable market response, leading to decreased sales or market share.

#### Chart - 7

**PC (Primary camera Megapixels)**

In [None]:
# Chart - 7 visualization code
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.kdeplot(data=df, x='n_cores', hue='price_range', ax=axs[0])
sns.boxplot(data=df, x='price_range', y='n_cores', ax=axs[1])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.


The chart on the left is a Kernel Density Estimation (KDE) plot, showing the distribution of the 'n_cores' feature colored by 'price_range' categories. The chart on the right is a box plot representing the relationship between 'price_range' and 'n_cores'. These charts are used to visualize the relationship between features and target classes, aiding in feature importance analysis and potential separation between classes.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Primary camera megapixels are showing a little variation along the target categories, which is a good sign for prediction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights from these charts can have a positive business impact by identifying feature patterns that differentiate price ranges, helping manufacturers and retailers tailor their offerings and marketing strategies more effectively.

#### Chart - 8

**Mobile weight**

In [None]:
# Chart - 8 visualization code
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.kdeplot(data=df, x='mobile_wt', hue='price_range', ax=axs[0])
sns.boxplot(data=df, x='price_range', y='mobile_wt', ax=axs[1])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The chart on the left is a Kernel Density Estimation (KDE) plot, displaying the distribution of 'mobile_wt' (mobile weight) for different 'price_range' categories. The chart on the right is a box plot illustrating the distribution of 'mobile_wt' across the 'price_range' categories.

These charts are used to visualize the relationship between features and target classes. The KDE plot shows the distribution of mobile weights across different price ranges, helping to identify potential patterns or overlaps. The box plot offers insights into the central tendency, variability, and potential outliers within each price range.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Costly phones are lighter

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can positively impact the business by aiding in product positioning and pricing strategies. For instance, understanding how mobile weight correlates with price ranges can guide manufacturers in designing phones that align with target market preferences.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

**Screen_size**

Let's convert screen_size from cm to inches, since in real life we use inches to tell a screen size.

In [None]:
# Defining new variable sc_size
df['sc_size'] = np.sqrt((df['sc_h']**2) + (df['sc_w']**2))
df['sc_size'] = round(df['sc_size']/2.54, 2)

In [None]:
# Chart - 10 visualization code
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.kdeplot(data=df, x='sc_size', hue='price_range', ax=axs[0])
sns.boxplot(data=df, x='price_range', y='sc_size', ax=axs[1])
plt.show()

In [None]:
# Let's drop sc_h and s_w

df.drop(['sc_h', 'sc_w'], axis = 1, inplace = True)

In [None]:
binary_features = [ 'four_g', 'three_g']

##### 1. Why did you pick the specific chart?

Answer Here.

The first chart is a Kernel Density Estimate (KDE) plot, and the second one is a Boxplot. In a classification project, these charts provide insights into the distribution and relationship of a feature (screen size 'sc_size') across different price range categories. The KDE plot shows the density of screen sizes for each price range, while the Boxplot depicts the distribution, median, and outliers in a compact form.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Screen Size shows little variation along the target variables. This can be helpful in predicting the target categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can have a positive business impact by helping manufacturers identify price range differentiators based on screen size. For instance, if higher-priced phones tend to have larger screen sizes, this knowledge can guide product differentiation and pricing strategies. However, if a specific screen size negatively impacts certain price ranges, adjustments might be needed to cater to diverse customer preferences and maximize sales within those ranges.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Plot of binary features against price range

for col in binary_features:
  fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (15, 6))

  df[col].value_counts().plot.pie (autopct='%1.1f%%', ax = ax1, shadow=True, labeldistance=None)
  ax1.set_title('Distribution by price range')
  ax1.legend(['Support', 'Does not Support'])
  sns.countplot(x = col, hue = 'price_range', data = df, ax = ax2, color = 'pink')
  ax2.set_title('Distribution by price range')
  ax2.set_xlabel(col)
  ax2.legend(['Low Cost', 'Medium Cost', 'High Cost', 'Very High Cost'])
  ax2.set_xticklabels(['Does not Support', 'Support'])

##### 1. Why did you pick the specific chart?

Answer Here.

The first chart is a pie chart that visualizes the distribution of a binary feature (e.g., binary_features) in terms of the percentage of "Support" and "Does not Support" categories. This chart provides an overview of the proportion of each category within the binary feature.

The second chart is a grouped bar chart that displays the distribution of the binary feature in relation to different price ranges (e.g., Low Cost, Medium Cost, High Cost, Very High Cost). Each bar in the chart represents the count of instances belonging to each combination of the binary feature and a specific price range.

The pie chart shows the overall distribution of "Support" and "Does not Support" categories, while the grouped bar chart provides a more detailed view of how the binary feature's distribution varies across different price ranges. This insight aids in understanding the potential impact of the binary feature on the classification outcome.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

feature 'three_g' play an important feature in prediction

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes the gained insights  create a positive business impact. By understanding how certain binary features correlate with price ranges, businesses can tailor their marketing strategies, product offerings, and pricing to specific customer preferences. This customization can enhance customer satisfaction and increase sales, contributing positively to the bottom line.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
correlation = df.corr()
plt.figure(figsize = [10, 15])
sns.heatmap(correlation, cmap = 'coolwarm', annot = True)

##### 1. Why did you pick the specific chart?

Answer Here.


The  heatmap,used to visualize the correlation between numerical variables in a dataset. In a classification project, the heatmap helps identify relationships between features, highlighting which attributes have stronger correlations with the target variable.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

RAM and price_range shows high correlation which is a good sign, it signifies
that RAM will play major deciding factor in estimating the price range.

There is some collinearity in feature pairs ('pc', 'fc') and ('px_width', 'px_height'). Both correlations are justified since there are good chances that if front camera of a phone is good, the back camera would also be good.

Also, if px_height increases, pixel width also increases, that means the overall pixels in the screen. We can replace these two features with one feature.
Front Camera megapixels and Primary camera megapixels are different entities despite of showing colinearity. So we'll be keeping them as they are.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes insights are help creating a positive business impact as a Positive correlations suggest attributes that contribute positively to the target, aiding feature selection and model building. Gained insights can positively impact business decisions by refining feature selection, enhancing model accuracy, and facilitating targeted marketing

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# looking for outliers using box plot
plt.figure(figsize=(25,10))
for index,item in enumerate([i for i in df.describe().columns.to_list()] ):
  plt.subplot(5,5,index+1)
  sns.boxplot(df[item])
print("\n")

##### 1. Why did you pick the specific chart?

Answer Here.

Boxplots charts with each subplot representing the distribution of a specific numerical attribute in the dataset. This type of chart, often referred to as a "boxplot grid" or "boxplot matrix," is used to visualize the distribution, central tendency, and spread of data within each attribute. In a classification project, this chart helps in understanding the distribution of numerical features among different classes, which aids in identifying potential outliers, class separability, and feature importance for classification.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

There are no much outliers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights from this chart can lead to a positive business impact by highlighting attribute variations across different price ranges of mobile phones. Understanding feature distributions among price categories can guide manufacturers and retailers in decision-making, allowing them to tailor product features and marketing strategies to specific target price segments.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#  defining new variable for pixels

df['pixels'] = df['px_height']*df['px_width']
# Dropping px_height and px_width

df.drop(['px_height', 'px_width'], axis = 1, inplace = True)
# Checking for multi-collinearity

correlation = df.corr()
plt.figure(figsize = [20, 15])
sns.heatmap(correlation, cmap = 'coolwarm', annot = True)

In [None]:
# Defining X and y

X = df.drop(['price_range'], axis = 1)
y = df['price_range']

In [None]:
X.shape

In [None]:
y.shape

In [None]:
# Scaling values of X

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Splitting dataset into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.20, random_state = 42)

In [None]:
X_train.shape

In [None]:
y_train.shape

##### 1. Why did you pick the specific chart?

Answer Here.

Heatmap is used to visualize the correlation between different attributes of the dataset. It helps identify potential multicollinearity (high correlation between features) and assess the relationships between variables, aiding in feature selection and model interpretation.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

RAM and price_range shows high correlation which is a good sign, it signifies
that RAM will play major deciding factor in estimating the price range.

There is some collinearity in feature pairs ('pc', 'fc') and ('px_width', 'px_height'). Both correlations are justified since there are good chances that if front camera of a phone is good, the back camera would also be good.

Also, if px_height increases, pixel width also increases, that means the overall pixels in the screen. We can replace these two features with one feature.
Front Camera megapixels and Primary camera megapixels are different entities despite of showing colinearity. So we'll be keeping them as they are.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from the heatmap can have a positive business impact by improving model performance through informed feature selection, reducing multicollinearity-related issues, and enhancing model interpretability.

#### Chart - 15 - Pair Plot

# Logistic Regression

In [None]:
# Pair Plot visualization code
# Applying logistic regression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
# Prediction

y_pred_test = lr.predict(X_test)
y_pred_train = lr.predict(X_train)
# Evaluation metrics for test

In [None]:
from sklearn.metrics import classification_report
print('Classification report for Logistic Regression (Test set)= ')
print(classification_report(y_pred_test, y_test))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
# Evaluation metrics for train

from sklearn.metrics import classification_report
print('Classification report for Logistic Regression (Train set)= ')
print( classification_report(y_pred_train, y_train))

# Random Forest

In [None]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
# taking 300 trees
clsr = RandomForestClassifier(n_estimators=300)
clsr.fit(X_train, y_train)

In [None]:
y_pred = clsr.predict(X_test)
test_score= accuracy_score(y_test, y_pred)
test_score

In [None]:
y_pred_train = clsr.predict(X_train)
train_score = accuracy_score(y_train, y_pred_train)
train_score

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
feature_importance = pd.DataFrame({'Feature':X.columns,
                                   'Score':clsr.feature_importances_}).sort_values(by='Score', ascending=False).reset_index(drop=True)
feature_importance.head()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax = sns.barplot(x=feature_importance['Score'], y=feature_importance['Feature'])
plt.show()

**Hyperparameter tuning for Random Forest**

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'n_estimators':[10,50,100,200],
          'max_depth':[10,20,30,40],
           'min_samples_split':[2,4,6],
          'max_features':['sqrt',4,'log2','auto'],
          'max_leaf_nodes':[10, 20, 40]
          }
rf = RandomForestClassifier()
clsr = GridSearchCV(rf, params, scoring='accuracy', cv=3)
clsr.fit(X, y)

In [None]:
clsr.best_params_

In [None]:
clsr.best_estimator_

In [None]:
clsr.best_score_

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clsr = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='log2',
                       max_leaf_nodes=40, max_samples=None,
                       min_impurity_decrease=0.0,
                       min_samples_leaf=1, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
clsr.fit(X_train, y_train)

In [None]:
y_pred = clsr.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
y_pred = clsr.predict(X_train)
accuracy_score(y_train, y_pred)

In [None]:
feature_importance = pd.DataFrame({'Feature':X.columns,
                                   'Score':clsr.feature_importances_}).sort_values(by='Score', ascending=False).reset_index(drop=True)
feature_importance.head()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax = sns.barplot(x=feature_importance['Score'], y=feature_importance['Feature'])
plt.show()

# Decision tree

In [None]:
# Applying Decision Tree

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(max_depth = 5)
dtc.fit(X_train, y_train)

In [None]:
# Prediction

y_pred_test = dtc.predict(X_test)
y_pred_train = dtc.predict(X_train)

In [None]:
accuracy_score(y_test, y_pred_test)

In [None]:
# Evaluation metrics for test

print('Classification report for Decision Tree (Test set)= ')
print(classification_report(y_pred_test, y_test))

In [None]:
# Cross validation

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(dtc, param_grid = {'max_depth': (5, 30), 'max_leaf_nodes': (10, 100)}, scoring = 'accuracy', cv = 5, verbose = 24)
grid.fit(X_train, y_train)

In [None]:
# Prediction

y_pred_test = grid.predict(X_test)
y_pres_train = grid.predict(X_train)
# Evaluation metrics for test

print('Classification Report for Decision Tree (Test set)= ')
print(classification_report(y_test, y_pred_test))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
# Evaluation metrics for train

print('Classification Report for Decision Tree (Train set)= ')
print(classification_report(y_train, y_pred_train))

# xgboost

In [None]:
# Applying XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(max_depth = 5, learning_rate = 0.1)
xgb.fit(X_train, y_train)
XGBClassifier(max_depth=5, objective='multi:softprob')
# Prediction

y_pred_train = xgb.predict(X_train)
y_pred_test = xgb.predict(X_test)
# Evaluation metrics for test

score = classification_report(y_test, y_pred_test)
print('Classification Report for XGBoost(Test set)= ')
print(score)

In [None]:
# Evaluation metrics for train

score = classification_report(y_train, y_pred_train)
print('Classification Report for XGBoost(Train set)= ')
print(score)

In [None]:
# Cross validation

grid = GridSearchCV(xgb, param_grid={'n_estimators': (10, 200), 'learning_rate': [1, 0.5, 0.1, 0.01, 0.001], 'max_depth': (5, 10),
                                     'gamma': [1.5, 1.8], 'subsample': [0.3, 0.5, 0.8]}, cv = 5, scoring = 'accuracy', verbose = 10)
grid.fit(X_train,y_train)

In [None]:
# Prediction

y_pred_train = grid.predict(X_train)
y_pred_test = grid.predict(X_test)
# Evaluation metrics for test

score = classification_report(y_test, y_pred_test)
print('Classification Report for tuned XGBoost(Test set)= ')
print(score)

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
# Evaluation metrics for train

score = classification_report(y_train, y_pred_train)
print('Classification Report for tuned XGBoost(Train set)= ')
print(score)

##### 1. Why did you pick the specific chart?

Answer Here.

 Heatmap Confusion Matrix commonly used  to visually represent the performance of a classification model by showing how actual and predicted classes align. Each cell in the heatmap represents the count or percentage of instances that belong to a specific class.

The heatmap provides insights into the model's accuracy and its ability to correctly classify instances. It highlights areas of high and low performance, showing where the model is making correct predictions and where it's making errors.The heatmap confusion matrix is a valuable tool for understanding classification model performance and making informed decisions to positively impact business outcomes.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the above insights, we can observe that the diagonal cells (top-left to bottom-right) have higher values, indicating that the model correctly predicts instances belonging to their respective classes. However, there are a few off-diagonal cells with elevated values, suggesting instances that were misclassified. This indicates specific areas of the classification task that may require further attention and model improvement, contributing to more accurate price range predictions for mobile phones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights have a positive business impact. By analyzing the confusion matrix, businesses can identify which classes the model is consistently misclassifying and take targeted actions to improve accuracy. For instance, adjusting marketing strategies for specific phone categories or refining product features can lead to better alignment with consumer preferences.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

To achieve the business objective of accurately predicting mobile phone price ranges:

Feature Importance Analysis: Identify key attributes driving accurate predictions and refine data collection strategies for those features.

Model Refinement: Continuously update and fine-tune classification algorithms to enhance prediction accuracy.

Feedback Loop: Incorporate user feedback on misclassified instances to improve model performance over time.

Customer Insights: Leverage predictions to understand customer preferences and tailor marketing strategies accordingly.

Competitor Analysis: Analyze competitors' pricing strategies in relation to predictions, enabling strategic market positioning.

# **Conclusion**

Write the conclusion here.

1.  From EDA we can see that here are mobile phones in 4 price ranges. The    number of elements is almost similar.
2.  half the devices have Bluetooth, and half don’t
3.   there is a gradual increase in battery as the price range increases
4.   Ram has continuous increase with price range while moving from Low cost to Very high cost
5.   costly phones are lighter
6.   RAM, battery power, pixels played more significant role in deciding the price range of mobile phone.
7.   form all the above experiments we can conclude that logistic regression and, XGboosting with using hyperparameters we got the best results

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***