# **Feature Extraction and Price Prediction for Mobile Phones**


**Problem Statement**
- I worked for a prominent organization that specializes in selling mobile phones. The organization is keen to enhance its pricing strategy by gaining a deeper understanding of the key features that influence the prices of mobile phones in today's highly competitive market. my objective is to build a predictive model that can accurately estimate the price of a mobile phone based on its features. To achieve this, you'll perform a feature extraction analysis to identify the most influential features.



**Project Description:**
- In this project, I worked with a dataset that contains detailed information about various mobile phones, including their model, color, memory, RAM, battery capacity, rear camera specifications, front camera specifications, presence of AI lens, mobile height, processor, and, most importantly, the price.
- My goal is to develop a predictive model for mobile phone prices.


**Data Wrangling**
- It convert and format raw data to usable format down to data science pipeline

# **Data Exploration:**

**Import Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
import warnings
# For Jupyter Notebook
%matplotlib inline
warnings.filterwarnings('ignore')


**Loading the given Dataset**

In [2]:
data=pd.read_csv('/content/Processed_Flipdata - Processed_Flipdata (1).csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/Processed_Flipdata - Processed_Flipdata (1).csv'

In [None]:
data # to check data

**Check the shape of data**

In [None]:
data.shape[0]#only no of rows

In [None]:
data['Prize'].unique()


In [None]:
data.shape[1]#only no of columns

**Check first five Rows**

In [None]:
data.head()

**Check all the column names of dataset**

In [None]:
data.columns

In [None]:
# Number of uniqe elements in each columns
unique = data.nunique()
unique.to_frame().T

In [None]:
data.rename(columns={'Prize': 'Price'}, inplace=True)

**Convert the Price column to numeric:**

In [None]:
# Remove non-numeric characters if necessary
data['Price'] = data['Price'].replace('[\$,]', '', regex=True).astype(float)

# Verify the conversion
data['Price'].head()


In [None]:
data.dtypes

In [None]:
data.head()

**check info of dataset**

In [None]:
# Getting the informaatin of the data
data.info()

**Descriptive Statistics**

**Summary Statistics**
- **Measure of central Tendancy**
- **Mean**: Mean is the average of all values
- **mode**: Median is the middle value when data is sorted.
-**Median** : Mode is the most frequently occurring value in the dataset.
-**Describe()** is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values.

In [None]:
data.describe(include='all')

In [None]:
data.dtypes

**It filters columns that have object data types, which typically represent strings or categorical variables.**

In [None]:
cat_col=data.select_dtypes(include='object')

In [None]:
cat_col

**It filters columns that have int64,float data types, which typically represent numerical features**

In [None]:
num_col=data.select_dtypes(include=['int64', 'float'])

In [None]:
num_col

-**Measure of dispersion**: Measures of Dispersion are used to represent the scattering of data. These are the numbers that show the various aspects of the data spread across various parameters.
 - **Range**: It is defined as the difference between the largest and the smallest value in the distribution.
 -**Starndard deviation**: It is the square root of the arithmetic average of the square of the deviations measured from the mean.
 -**percentiles**:  How many of the values are less than the given percentile


In [None]:
data['Price'].unique()

In [None]:
data['Price'].nunique()

In [None]:
data['Price'].value_counts()

In [None]:
numerical_summary = num_col.describe().transpose()
palette = sns.color_palette("viridis", as_cmap=True)
numerical_summary.style.background_gradient(cmap=palette)

**The dataset comprises 541 observations, detailing various attributes like memory, RAM, battery capacity, AI lens presence, and mobile height. Key highlights include a median memory of 128.000000, a median RAM of 6.000000, and a median battery capacity of 5000.000000 mAh.**

### **The range of values for each feature.**

In [None]:
# Check range for 'Memory'
if data['Memory'].dtype in ['int64', 'float64']:
    print(f"Memory: min = {data['Memory'].min()}, max = {data['Memory'].max()}")
else:
    print(f"Memory: unique values = {data['Memory'].unique()}")



In [None]:
# Check range for 'Price'
if data['Price'].dtype in ['int64', 'float64']:
    print(f"Price: min = {data['Price'].min()}, max = {data['Price'].max()}")
else:
    print(f"Price: unique values = {data['Price'].unique()}")

In [None]:
 #Check range for 'RAM'
if data['RAM'].dtype in ['int64', 'float64']:
    print(f"RAM: min = {data['RAM'].min()}, max = {data['RAM'].max()}")
else:
    print(f"RAM: unique values = {data['RAM'].unique()}")



In [None]:
# Check range for 'Battery_'
if data['Battery_'].dtype in ['int64', 'float64']:
    print(f"Battery_: min = {data['Battery_'].min()}, max = {data['Battery_'].max()}")
else:
    print(f"Battery_: unique values = {data['Battery_'].unique()}")



In [None]:
# Check range for 'AI Lens'
if data['AI Lens'].dtype in ['int64', 'float64']:
    print(f"AI Lens: min = {data['AI Lens'].min()}, max = {data['AI Lens'].max()}")
else:
    print(f"AI Lens: unique values = {data['AI Lens'].unique()}")



In [None]:
# Check range for 'Mobile Height'
if data['Mobile Height'].dtype in ['int64', 'float64']:
    print(f"Mobile Height: min = {data['Mobile Height'].min()}, max = {data['Mobile Height'].max()}")
else:
    print(f"Mobile Height: unique values = {data['Mobile Height'].unique()}")



In [None]:
# Check range for 'Model'
print(f"Model: unique values = {data['Model'].unique()}")



In [None]:
# Check range for 'Colour'
print(f"Colour: unique values = {data['Colour'].unique()}")



In [None]:
# Check range for 'Rear Camera'
print(f"Rear Camera: unique values = {data['Rear Camera'].unique()}")



In [None]:
# Check range for 'Front Camera'
print(f"Front Camera: unique values = {data['Front Camera'].unique()}")



In [None]:
# Check range for 'Processor_'
print(f"Processor_: unique values = {data['Processor_'].unique()}")



# **Data Visualization**

## **Line Chart:**
Line graphs are used to track changes over short and long periods of time. When smaller changes exist, line graphs are better to use than bar graphs. Line graphs can also be used to compare changes over the same period of time for more than one group

In [None]:
#Plot the Data
plt.figure(figsize=(16,6))
sns.lineplot(data=data)

In [None]:
#We will begin by printing the names of all columns
list(data.columns)
#We plot the lines corresponding to the first two columns in the dataset.
plt.figure(figsize=(14,6))
plt.title("Battery power by products")
#Line chart shows battery power for the each products
sns.lineplot(data=data['Battery_'], label="Battery Size")
#Line chart shows best price for the each powers
sns.lineplot(data=data['Price'], label= "Best price")

## **Bar Chart**
- Bar graphs are used to compare things between different groups or to track changes over time. However, when trying to measure change over time, bar graphs are best when the changes are larger.

In [None]:
#Set the width and height of the figure
plt.figure(figsize=(10,6))
#Add title
plt.title("Average battery power by phone model")
#Bar chart showing average battery power by phone brands
# Group the data by 'Model' and calculate the mean battery power for each group.
# Then reset the index to convert the result into a DataFrame suitable for plotting.
model_battery = data.groupby('Model')['Battery_'].mean().reset_index()
sns.barplot(x='Model', y='Battery_', data=model_battery)
#Add label for vertical axis
plt.ylabel("battery_size ")

In [None]:
import plotly.express as px
fig = px.violin(data, x="Price", y="RAM", color="Price", box=True,points = "all")
fig.show()

In [None]:
fig = px.scatter_3d(data, x='RAM', y='Memory', z='Price',
              color='Price')
fig.show()

# **Histogram**

In [None]:
# Visualize data using histograms
for column in num_col:
    plt.figure(figsize=(8, 6))
    plt.hist(data[column], bins=20, edgecolor='k', alpha=0.7)
    plt.title(f'Histogram for {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

## **Box Plot**

In [None]:
#  Create the plot grid
rows = 3
columns = 2

fig, axes = plt.subplots(rows,columns, figsize=(30,30))

x, y = 0, 0

for i, column in enumerate(num_col):
    sns.boxplot(x=data[column], ax=axes[x, y])

    if y < columns-1:
        y += 1
    else: # Simplified condition - reset y when a row is filled
        x += 1
        y = 0

    # Check if we've filled all subplots
    if x >= rows:
        break

plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()

In [None]:
#Check noises by pairplot
sns.set_palette('crest')
sns.set_style('darkgrid')
# Convert num_cols to a list
dnp = sns.pairplot(data, vars=num_col.columns.tolist())

# Add axis labels and tick labels to the plot
dnp.set(xticklabels=[], yticklabels=[])
dnp.axes[0][0].set_ylabel(num_col.columns.tolist()[0], fontsize=14)
dnp.axes[-1][0].set_xlabel(num_col.columns.tolist()[0], fontsize=14)
dnp.axes[-1][0].xaxis.labelpad = 20
dnp.axes[-1][-1].yaxis.labelpad = 20

# Title of the plot
dnp.fig.suptitle('Pairplot for each variable\n(Range: min={}, max={})'.format(data[num_col.columns].min().min(), data[num_col.columns].max().max()), y=1.03, fontsize=25)

# Show the plot
plt.show()


# **Data Cleaning:**

**Missing Values**

In [None]:
data.isnull().sum()

In [None]:
# Missing values
plt.figure(figsize=(22,4))
sns.heatmap((data.isna().sum()).to_frame(name='').T,cmap='GnBu', annot=True,
             fmt='0.0f').set_title('Count missing values', fontsize=18)
plt.show()

In [None]:
# Check for duplicates
data.duplicated().sum()

In [None]:
# Remove duplicates in-place
data.drop_duplicates(inplace=True)

In [None]:
data.columns

In [None]:
# Check for remaining missing values
missing_values = data.isnull().sum()
print(missing_values)

In [None]:
data.duplicated().sum()

In [None]:
# Drop the "Unnamed: 0" column
data = data.drop(columns=['Unnamed: 0'])

In [None]:
num_cols = data.select_dtypes(include=['int64', 'float64'])

## **Detect Outliers**

In [None]:
# Calculate the IQR for each column in the dataset
Q1 = num_cols.quantile(0.25)
Q3 = num_cols.quantile(0.75)
IQR = Q3 - Q1

# Identify outliers using the IQR method
outliers = ((num_cols < (Q1 - 1.5 * IQR)) | (num_cols > (Q3 + 1.5 * IQR)))

# Count the number of outliers for each variable
num_outliers = outliers.sum()

# Number of outliers for each variable
num_outliers.to_frame().T

## **Remove Outliers**

In [None]:
def clean_data(df):
    removed_outliers_info = {}  # Dictionary to store information about removed outliers

    def remove_outliers(column):
        Q1 = column.quantile(0.25)
        Q3 = column.quantile(0.75)
        IQR = Q3 - Q1
        return column[(column >= Q1 - 1.5 * IQR) & (column <= Q3 + 1.5 * IQR)]

    # Iterate over each numerical column
    for column in df.select_dtypes(include=['float64', 'int64']).columns:
        # Count rows before outlier removal
        rows_before = df.shape[0]

        # Remove outliers from the column
        cleaned_column = remove_outliers(df[column])

        # Count rows after outlier removal
        rows_after = cleaned_column.shape[0]

        # Calculate number of outliers removed
        outliers_removed = rows_before - rows_after

        # Store information about removed outliers
        removed_outliers_info[column] = outliers_removed

        # Replace the column with cleaned data and fill NaNs with median
        df[column] = cleaned_column
        df[column].fillna(cleaned_column.median(), inplace=True)

    # Drop rows with any remaining NaN values
    df.dropna(inplace=True)

    # Create a DataFrame of cleaned data
    cleaned_data = df.copy()

    # Return cleaned DataFrame and removed outliers information
    return cleaned_data, pd.DataFrame.from_dict(removed_outliers_info, orient='index', columns=['Outliers Removed']).T

# Call the function to clean data
cleaned_data, df_removed_outliers = clean_data(data)

print("\nInformation about removed outliers:")
df_removed_outliers

In [None]:
cleaned_data

In [None]:
data.isnull().sum()

In [None]:
data['Price'].isnull().sum()

# **Exploratory Data Analysis**

**What are the most popular Model?**

In [None]:

brand_counts = data['Model'].value_counts()

# Get the top 5 most popular models
top_models = brand_counts.head(5)

# Create a bar chart using Plotly Express
fig = px.bar(x=top_models.index, y=top_models.values,
             labels={'x': 'Model', 'y': 'Count'},
             title='Top 5 Most Popular Models',
             template='plotly_dark')

# Update layout for better readability
fig.update_layout(xaxis_tickangle=-45, xaxis_tickfont_size=12)

fig.show()

**Whar are model with high storage ?**



In [None]:

# Find models with high storage (assuming 'Memory' represents storage size)
high_storage_models = data.sort_values(by='Memory', ascending=False).head(5)

print("Models with High Storage:")
print(high_storage_models[['Model', 'Memory']])

# Visualize models with high storage
fig = px.bar(high_storage_models, x='Model', y='Memory',
             labels={'Model': 'Model', 'Memory': 'Memory (GB)'},
             title='Models with High Storage',
             template='plotly_dark')

fig.show()

**visualize the relationship between 'RAM' and 'Memory' of mobile phones, and use 'Price' to determine the color of the points**.

In [None]:
# Create a scatter plot with color mapping based on 'Price'
fig = px.scatter(data_frame=data, x='RAM', y='Memory', color='Price',
                 labels={'RAM': 'RAM (GB)', 'Memory': 'Memory (GB)', 'Price': 'Price ($)'},
                 title='Scatter Plot: RAM vs Memory (Color by Price)',
                 template='plotly_dark')

fig.show()

**Which models offer the best value for money based on storage capacity and price?**

In [None]:
# Calculate the price per GB
data['Price_per_GB'] = data['Price'] / data['Memory']

# Sort the models by price per GB in ascending order (lower price per GB means better value for money)
best_value_models = data.sort_values(by='Price_per_GB')

# Display the top models offering the best value for money
best_value_models[['Model', 'Memory', 'Price', 'Price_per_GB']].T

In [None]:
# Visualize using Plotly Express
fig = px.bar(best_value_models,
             x='Model',
             y='Price_per_GB',
             hover_data=['Memory', 'Price'],
             title='Best Value for Money Models (Price per GB)',
             labels={'Price_per_GB': 'Price per GB'},
             height=600)

# Update layout for better visualization
fig.update_layout(xaxis_title='Model', yaxis_title='Price per GB', xaxis_tickangle=-45)

# Show the plot
fig.show()


In [None]:
data['Price'].isnull().sum()

## **Bar Chart: Average Battery Capacity by Processor**
**What is the average battery capacity for each processor type?**

In [None]:
# Group by processor type and calculate the average battery capacity
avg_battery_by_processor = data.groupby('Processor_')['Battery_'].mean().reset_index()

# Rename columns for clarity
avg_battery_by_processor.columns = ['Processor', 'Average_Battery_Capacity']

In [None]:
avg_battery_by_processor

## **Heatmap: Correlation Between Features**
**What are the correlations between different features in the dataset?**

In [None]:
num_cols.corr()

**The correlation matrix shows:**

- Most correlations between these features are weak, except for the strong positive relationships between Memory and RAM (0.625), and Battery and Mobile Height (0.696). This indicates that while some features like memory and RAM, and battery capacity and mobile height tend to increase together, most other features do not show strong linear relationships.

In [None]:
# Calculate the correlation matrix
correlation_matrix=num_cols.corr()
correlation_matrix
# Create a heatmap for the num cols
plt.figure(figsize=(10, 8))
sns.heatmap(num_col.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

**Histogram: Distribution of Mobile Heights**
**What is the distribution of mobile heights?**

In [None]:
# Create histogram
fig = px.histogram(data, x='Mobile Height',
                   title='Distribution of Mobile Heights',
                   labels={'Mobile Height': 'Mobile Height (mm)'},
                   template='plotly_dark')

# Show the plot
fig.show()

# **Data Preprocessing**

In [None]:
cleaned_data.head()

In [None]:
cleaned_data.shape

In [None]:
cleaned_data.isnull().sum()

### **Encode Categorical Features**

## **Label Encoding:**

**Convert categorical variables into numerical labels. This is useful for algorithms that can handle numerical inputs but not categorical inputs.**

In [None]:
cat=cleaned_data.select_dtypes(include=['object'])

In [None]:
cat

In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Dictionary to store the encoded values
encoded_columns = {}

# Encode categorical columns
for column in cat:
# Fit and transform the column
encoded_columns[column] = label_encoder.fit_transform(cleaned_data[column])

# Create a DataFrame of encoded values
encoded_data = pd.DataFrame(encoded_columns)

# Concatenate the encoded DataFrame with the cleaned_data DataFrame
combined_data = pd.concat([cleaned_data.drop(columns=cat), encoded_data], axis=1)

# Display the combined DataFrame
print("\nCombined DataFrame:")
combined_data

In [None]:
encoded_data

In [None]:
label_encoder

In [None]:
 cleaned_data[column]



---



In [None]:
encoded_data1

# **one-hot encoding**

**Convert categorical variables (e.g., model, colour) into a suitable numerical format, such as one-hot encoding.**


In [None]:
# Apply one-hot encoding to categorical columns
data_encoded = pd.get_dummies(data, columns=cf)


In [None]:
data_encoded

In [None]:
# Concatenate data1 and data2 along columns (axis=1)
data_new = pd.concat([cleaned_data, encoded_data], axis=1)

In [None]:
data_new

In [None]:
data_new.columns

In [None]:
# Assume 'prize' is the target variable
X = cleaned_data.drop('Price', axis=1)
y = cleaned_data['Price']


In [None]:
#Splitting data into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30,random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

**Feature Scaling**
- **Normalize/Standardize Numerical Features**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# **Feature Extraction**

# **1. Missing Value Ratio**

In [None]:
# Calculate the percentage of missing values for each feature
missing_ratio = data.isnull().mean()

# Display the missing value ratio
print("Missing Value Ratio:\n", missing_ratio)

**Remove Features with High Missing Value Ratio.**

In [None]:
# Set a threshold for missing values (e.g., 30%)
threshold = 0.30

# Select features that have a missing value ratio below the threshold
features_to_keep = missing_ratio[missing_ratio < threshold].index

# Create a new dataframe with selected features
data_reduced = data[features_to_keep]

# Display the reduced dataframe
data_reduced

In [None]:
num_cols

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the numeric features (mean=0 and variance=1)
scaler = StandardScaler()
# Assuming 'num_cols' contains the names of numeric columns
data[num_cols.columns] = scaler.fit_transform(data[num_cols.columns])


In [None]:
data[num_cols.columns]

# **Univarient Feature Selection**

In [None]:
data['Price']

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.impute import SimpleImputer


In [None]:
data.shape

In [None]:
cat_col.columns

In [None]:
data.isnull().sum()

In [None]:
# Create a DataFrame to hold feature scores
dfscores = pd.DataFrame(ordered_feature.scores_, columns=["Score"])

In [None]:
# Create a DataFrame to hold feature names
dfcolumns = pd.DataFrame(X_encoded.columns, columns=["Features"])

In [None]:
# Concatenate DataFrames to create a ranked feature list
features_rank = pd.concat([dfcolumns, dfscores], axis=1)

In [None]:
# Rename columns for clarity
features_rank.columns = ['Features', 'Score']

In [None]:
# Display the ranked features
features_rank

In [None]:
chi2_selector = SelectKBest(chi2, k=10)
X_kbest = chi2_selector.fit_transform(X, y)
chi2_scores = pd.DataFrame({'Feature': preprocessor.get_feature_names_out(), 'Chi2 Score': chi2_selector.scores_})
chi2_scores = chi2_scores.sort_values(by='Chi2 Score', ascending=False)
print("Chi-square Test:")
print(chi2_scores)

Feature Importance
This technique gives you a score for each feature of your data,the higher the score mor relevant it is

# **Filter Methods**
- Filter methods evaluate the intrinsic properties of the features based on univariate statistics and do not involve any machine learning model. These methods are fast and computationally efficient, making them suitable for high-dimensional data.

In [None]:
from sklearn.feature_selection import mutual_info_classif

# Assuming X and y are your features and target variable
information_gain = mutual_info_classif(X_imputed, y)

In [None]:
information_gain

# **Fisherâ€™s Score:**

Ranks variables based on their Fisher score, which measures the separation between classes.

In [None]:
!pip install skfeature

In [None]:
import numpy as np

def fisher_score(X, y):
    classes = np.unique(y)
    fisher_scores = []
    for feature in X.T:
        numerator = 0
        denominator = 0
        overall_mean = np.mean(feature)
        for c in classes:
            class_feature = feature[y == c]
            class_mean = np.mean(class_feature)
            class_variance = np.var(class_feature)
            numerator += len(class_feature) * (class_mean - overall_mean) ** 2
            denominator += len(class_feature) * class_variance
        fisher_scores.append(numerator / denominator)
    return np.array(fisher_scores)

fisher_scores = fisher_score(X.values, y.values)
fisher_score_df = pd.DataFrame({'Feature': X.columns, 'Fisher Score': fisher_scores})
fisher_score_df = fisher_score_df.sort_values(by='Fisher Score', ascending=False)
print(fisher_score_df)

# **Correlation Coefficient:**

Measures the linear relationship between features and the target.
High correlation with the target but low correlation among features is desired.

# **Feature Importance**
- This technique gives you a score for each feature of your data,the higher the score mor relevant it is

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Assuming 'X' is your DataFrame with categorical features
label_encoders = {}
imputer = SimpleImputer(strategy='most_frequent')  # Create imputer once
X_imputed = imputer.fit_transform(X)  # Impute missing values for the entire DataFrame

for i, col in enumerate(X.columns):
    if X[col].dtype == 'object':
        le = LabelEncoder()
        X_imputed[:, i] = le.fit_transform(X_imputed[:, i])  # Encode the imputed column
        label_encoders[col] = le  # Store the encoder for later use

X = pd.DataFrame(X_imputed, columns=X.columns)  # Convert imputed array back to DataFrame

model = ExtraTreesClassifier()
model.fit(X, y)  # Now fit the model with the imputed and encoded DataFrame


In [None]:
print(model.feature_importances_)

In [None]:
ranked_features=pd.Series(model.feature_importances_,index=X.columns)
ranked_features.nlargest(10).plot(kind='barh')
plt.show()

In [None]:
data_encoded.corr()

In [None]:
import seaborn as sns
corr=num_cols.iloc[:,:-1].corr()
top_features=corr.index
plt.figure(figsize=(20,20))
sns.heatmap(num_cols[top_features].corr(),annot=True)

 **Remove The correlated**

In [None]:
threshold=0.8

In [None]:
# find and remove correlated features
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
correlation(num_cols.iloc[:,:-1],threshold)

**Information Gain**

In [None]:
from sklearn.feature_selection import mutual_info_classif

In [None]:
mutual_info=mutual_info_classif(X,y)

In [None]:
mutual_data=pd.Series(mutual_info,index=X.columns)
mutual_data.sort_values(ascending=False)

**SelectKBest with f_regression**

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.impute import SimpleImputer
#Convert 'Price' to numeric
data['Price'] = pd.to_numeric(data['Price'], errors='coerce')

#Extract numerical features excluding 'Price'
numerical_features = data.select_dtypes(include=[np.number]).drop(columns=['Price'])

# Impute missing values in numerical features using the mean strategy
imputer = SimpleImputer(strategy='mean')
numerical_features_imputed = imputer.fit_transform(numerical_features)

# Example: Impute missing values in 'Price' column (target variable)
imputer_price = SimpleImputer(strategy='mean')
price_imputed = imputer_price.fit_transform(data[['Price']]).ravel()

# Example: Perform SelectKBest feature selection using f_regression
selector = SelectKBest(score_func=f_regression, k='all')
selector.fit(numerical_features_imputed, price_imputed)

# Get scores and feature names
feature_scores = pd.DataFrame({'Feature': numerical_features.columns, 'Score': selector.scores_})
feature_scores = feature_scores.sort_values(by='Score', ascending=False)



#Print top features
top_features = feature_scores['Feature'].tolist()[:5]
top_features

# **Dimentionality Reduction**

## **t-Distributed Stochastic Neighbor Embedding (t-SNE)**

In [None]:
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer # Import SimpleImputer

# Assuming 'data' is your DataFrame containing mobile phone data
# Extract numerical features excluding 'Price'
numerical_features = data.select_dtypes(include=[np.number]).drop(columns=['Price'])

# Impute missing values using the mean strategy (or any other suitable strategy)
imputer = SimpleImputer(strategy='mean') # Create an imputer instance
numerical_features_imputed = imputer.fit_transform(numerical_features) # Impute NaNs

# Standardize the features (mean=0 and variance=1)
scaler = StandardScaler()
numerical_features_scaled = scaler.fit_transform(numerical_features_imputed) # Scale imputed data

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=0)
tsne_components = tsne.fit_transform(numerical_features_scaled)

# Create a DataFrame for the t-SNE components
tsne_df = pd.DataFrame(data=tsne_components, columns=['Component 1', 'Component 2'])

# Plotting t-SNE results
plt.figure(figsize=(10, 6))
plt.scatter(tsne_df['Component 1'], tsne_df['Component 2'])
plt.title('t-SNE Visualization')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.grid(True)
plt.show()