**Laptop Price Prediction Using Linear Regression**
*Introduction*
This project aims to build a linear regression model to predict laptop prices based on various features such as brand, processor, RAM, hard drive type, screen size, and others. The dataset used in this project is sourced from Kaggle and includes a wide range of information on laptop specifications and their prices.

*Data Source*
The dataset for this project is sourced from the following link:
https://www.kaggle.com/datasets/muhammetvarl/laptop-price/discussion/361977.
This dataset contains detailed specifications and pricing information for laptops, making it an ideal candidate for regression analysis.

*Project Objective*
The primary objective of this project is to understand how various laptop features influence their prices and to build an effective linear regression model that can be used to predict the prices of laptops based on their specifications. This model could be beneficial for consumers seeking pricing guidance and for vendors in setting competitive prices for their products.

*Methodology*
This project follows a systematic approach to build and evaluate a linear regression model for predicting laptop prices. The methodology includes several key steps, as outlined below:

* Data Cleaning
- Handling Missing Data: Identify and address missing values in the dataset. Missing data can significantly affect regression analysis. Options for handling missing data include imputation (filling in missing values with statistical measures like mean or median) or removal of records with missing values.
- Outlier Detection and Removal: Identify and eliminate outliers, which are data points significantly different from others. Outliers can distort the analysis and model performance. Techniques such as Z-score and IQR (Interquartile Range) can be used for outlier detection.
* Feature Encoding
- Categorical Variables: Categorical variables cannot be directly used in regression models. Encoding techniques, such as one-hot encoding, are applied to transform these variables into a numerical format. This process involves creating additional binary (0 or 1) columns for each category of the variable.
* Variable Normalization
- Standardization: Ensure that numeric variables are standardized, which means they are scaled to have a mean of 0 and a standard deviation of 1. Standardization helps in comparing the influence of different variables on the model and can improve the convergence of gradient-based optimization algorithms used in model training.
* Predictions Using Different Models
- Model Exploration: Besides linear regression, explore other regression models to compare performances. Common alternatives include Ridge Regression, Lasso Regression, and Polynomial Regression. This comparative analysis can help in identifying the most suitable model for predicting laptop prices based on the given features.
* Model Evaluation
- Use appropriate metrics to evaluate the performance of the regression models. Common metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² (Coefficient of Determination). These metrics provide insights into the accuracy and reliability of the predictions.

**Reading Data**

In [55]:
import pandas as pd

df = pd.read_csv('laptop_price.csv', encoding='latin-1')
df.head()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_euros
0,1,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,1339.69
1,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94
2,3,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,575.0
3,4,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,2537.45
4,5,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,1803.6


**Preprocessing data**


In [56]:
# When analyzing data, we encounter a few rare companies. Training a model with such a limited dataset could be impossible, so I decided to create a category 'Others' that encompasses all these rare companies.

brand_counts = df['Company'].value_counts()
small_brands = brand_counts[brand_counts < 10].index
df['Company'] = df['Company'].apply(lambda x: "Others" if x in small_brands else x)

In [57]:
# In the dataset, there are a few Operating Systems that can be categorized as more common OSes. To streamline this, I decided to unify certain Operating Systems by replacing specific versions with their more general counterparts.

df['OpSys'] = df['OpSys'].replace({'Windows 10 S': 'Windows 10', 'Mac OS X': 'macOS'})

# In the dataset, some Operating Systems are less common. To improve predictions, I categorized these rare Operating Systems under a unified label named 'OtherSys'. This was accomplished by replacing occurrences of 'Chrome OS' and 'Android' in the 'OpSys' column of the dataframe df with 'OtherSys'.

df['OpSys'] = df['OpSys'].replace(['Chrome OS', 'Android'], 'OtherSys')

In [58]:
# The 'Memory' column contains both the value and type of memory. I created a new column named 'Memory_Type' and extracted the type of memory using regex. This provided me with a column containing valuable information regarding the type of memory in each device, whether it's SSD, HDD, or Flash Storage.

df['Memory_Type'] = df['Memory'].str.extract(r'(SSD|HDD|Flash Storage)', expand=False)

In [59]:
# The 'Memory' column contains values specified in GB and TB. To standardize these values, I converted all of them to GB and removed the units. This process resulted in numeric values all in a single unit of measurement, GB, making the data more consistent and easier to analyze.

import re
def convert_to_gb(value):
    unit_map = {'TB': 1024, 'GB': 1}
    match = re.search(r'(\d+)(TB|GB)', value, re.IGNORECASE)
    if match:
        num, unit = match.groups()
        return float(num) * unit_map[unit.upper()]
    else:
        return None

df['Memory_size_GB'] = df['Memory'].apply(convert_to_gb)

In [60]:
# The 'Cpu' column contains information about both the processor model and its clock frequency. To better organize this data, I split it into two separate columns. Using a lambda function, I separated the processor model and clock frequency at the last space character, resulting in two new columns: 'Processor' for the processor model and 'Clock' for the clock frequency.

df[['Processor', 'Clock']] = df['Cpu'].apply(lambda x: pd.Series(str(x).rsplit(' ', 1)))

In [61]:
# To streamline the processor information, I implemented a function called trim_processor to remove less relevant parts of the description. This function checks if the processor description starts with 'Intel'; if so, it retains only the first three words of the description. For other processors, it keeps the first two words.

def trim_processor(value):
    if value.startswith('Intel'):
        return ' '.join(value.split(' ', 3)[:3])
    else:
        return ' '.join(value.split(' ', 2)[:2])

df['Processor'] = df['Processor'].apply(trim_processor)

In [0]:
# To provide better learning models I ograniczyłem number of Processors and changed rare processors to Others

cpu_counts = df['Processor'].value_counts()
rare_cpu = cpu_counts[cpu_counts < 10].index
df['Processor'] = df['Processor'].apply(lambda x: "Others" if x in rare_cpu else x)

In [0]:
# In the columns 'Clock', 'Ram', and 'Weight', the units are consistent, so I decided to remove the units to obtain numeric values. This was achieved by applying transformation functions to each column.

df['Clock'] = df['Clock'].apply(lambda x: float(str(x).strip("GHz")))
df['Ram'] = df["Ram"].apply(lambda x: int(str(x).strip("GB")))
df['Weight'] = df["Weight"].apply(lambda x: float(str(x).strip("kg")))

In [0]:
# To disregard non-essential information, I opted to eliminate all detailed GPU information and retain only the brand. This was accomplished by creating a new column 'Gpu_Brand' in the dataframe df. I extracted the brand from the 'Gpu' column by splitting each entry at the first space and keeping only the first part, which typically represents the brand name. This approach simplifies the GPU data to focus solely on the brand, which might be sufficient for modeling purposes.

df['Gpu_Brand'] = df['Gpu'].str.split(n=1, expand=True)[0]

In [0]:
# To obtain a numeric value and simplify the screen resolution information, I decided to limit the 'Screen Resolution' data to just the 'Screen Width'. This was done by creating a new column named 'Screen_Width' in the dataframe `df'. I extracted the width from the 'ScreenResolution' column by splitting each entry from the right at the last space, taking the last part which typically contains the resolution (e.g., '1920x1080'), and then splitting this by 'x' to separate width from height. I kept only the width (the first part) and converted it to an integer. This method effectively reduces the screen resolution data to a single, numeric width value, making it more straightforward for numerical analysis or modeling.

df['Screen_Width'] = df['ScreenResolution'].str.rsplit(' ', n=1).str[-1].str.split('x').str[0].astype(int)

In [0]:
# After extracting the relevant data, I remove unnecessary columns from the dataframe to streamline it.
df = df.drop(['laptop_ID', 'Product', 'ScreenResolution', 'Cpu', 'Memory', 'Gpu'], axis=1)

In [62]:
# Now we can see how DataFrame looks
df.head()

Unnamed: 0,Company,TypeName,Inches,Ram,OpSys,Weight,Price_euros,Memory_Type,Memory_size_GB,Processor,Clock,Gpu_Brand,Screen_Width
0,Apple,Ultrabook,13.3,8,macOS,1.37,1339.69,SSD,128.0,Intel Core i5,2.3,Intel,2560
1,Apple,Ultrabook,13.3,8,macOS,1.34,898.94,Flash Storage,128.0,Intel Core i5,1.8,Intel,1440
2,HP,Notebook,15.6,8,No OS,1.86,575.0,SSD,256.0,Intel Core i5,2.5,Intel,1920
3,Apple,Ultrabook,15.4,16,macOS,1.83,2537.45,SSD,512.0,Intel Core i7,2.7,AMD,2880
4,Apple,Ultrabook,13.3,8,macOS,1.37,1803.6,SSD,256.0,Intel Core i5,3.1,Intel,2560


**Cleaning data**

In [63]:
any_nan = df.isnull().values.any()
any_nan

True

In [64]:
df.dropna()
df.drop_duplicates(keep='first', inplace=True)

def remove_outliers(df, columns):
    indices_to_remove = set()

    for column in columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers_indices = df[(df[column] < lower_bound) | (df[column] > upper_bound)].index
        indices_to_remove.update(outliers_indices)

    df_cleaned = df.drop(indices_to_remove)

    return df_cleaned
df = remove_outliers(df, ['Price_euros', "Inches", 'Ram'])

**Splitting data**

In [65]:
from sklearn.model_selection import train_test_split

X = df.drop('Price_euros', axis=1)
y = df['Price_euros']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

**One Hot Encoding**

In [66]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
columns_to_encode=['Company','TypeName', 'OpSys', 'Memory_Type', 'Gpu_Brand', "Processor"]
encoder.fit(X_train[columns_to_encode])

X_train_encoded = encoder.transform(X_train[columns_to_encode])
X_test_encoded = encoder.transform(X_test[columns_to_encode])

X_train_encoded_df = pd.DataFrame(X_train_encoded.toarray(), columns=encoder.get_feature_names_out(columns_to_encode), index=X_train.index)
X_test_encoded_df = pd.DataFrame(X_test_encoded.toarray(), columns=encoder.get_feature_names_out(columns_to_encode), index=X_test.index)

X_train_dropped = X_train.drop(columns=columns_to_encode)
X_test_dropped = X_test.drop(columns=columns_to_encode)

X_train_final = pd.concat([X_train_dropped, X_train_encoded_df], axis=1)
X_test_final = pd.concat([X_test_dropped, X_test_encoded_df], axis=1)

In [67]:
X_train_final.dtypes

Inches                          float64
Ram                               int64
Weight                          float64
Memory_size_GB                  float64
Clock                           float64
Screen_Width                      int32
Company_Acer                    float64
Company_Apple                   float64
Company_Asus                    float64
Company_Dell                    float64
Company_HP                      float64
Company_Lenovo                  float64
Company_MSI                     float64
Company_Others                  float64
Company_Toshiba                 float64
TypeName_2 in 1 Convertible     float64
TypeName_Gaming                 float64
TypeName_Netbook                float64
TypeName_Notebook               float64
TypeName_Ultrabook              float64
TypeName_Workstation            float64
OpSys_Linux                     float64
OpSys_No OS                     float64
OpSys_OtherSys                  float64
OpSys_Windows 10                float64


**Normalization**

In [68]:
mean_values = X_train_final.mean()
std_values = X_train_final.std()

print("Średnie wartości dla każdej kolumny:\n", mean_values)
print("\nOdchylenia standardowe dla każdej kolumny:\n", std_values)

Średnie wartości dla każdej kolumny:
 Inches                            15.001724
Ram                                6.674877
Weight                             1.960293
Memory_size_GB                   451.256158
Clock                              2.283374
Screen_Width                    1837.995074
Company_Acer                       0.092365
Company_Apple                      0.014778
Company_Asus                       0.107143
Company_Dell                       0.229064
Company_HP                         0.240148
Company_Lenovo                     0.230296
Company_MSI                        0.019704
Company_Others                     0.032020
Company_Toshiba                    0.034483
TypeName_2 in 1 Convertible        0.083744
TypeName_Gaming                    0.086207
TypeName_Netbook                   0.006158
TypeName_Notebook                  0.655172
TypeName_Ultrabook                 0.147783
TypeName_Workstation               0.020936
OpSys_Linux                        0.0

In [69]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train_final)
X_train_scaled = scaler.transform(X_train_final)
X_test_scaled = scaler.transform(X_test_final)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_final.columns, index=X_train_final.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_final.columns, index=X_test_final.index)


**Regression**

In [70]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled_df, y_train)
y_pred_linear = linear_model.predict(X_test_scaled_df)
print(f'Coefficients: {linear_model.coef_}')

Coefficients: [ 2.53551692e+01  6.91430247e+01 -7.23464264e+01  1.45730037e+01
  6.17681853e+01  8.04369991e+01  4.15102585e+14  8.51346479e+14
  4.43424507e+14  6.02468993e+14  6.12422234e+14  6.03603673e+14
  1.99254247e+14  2.52399801e+14  2.61594199e+14 -7.35699261e+13
 -7.45436010e+13 -2.07769199e+13 -1.26239198e+14 -9.42547110e+13
 -3.80248422e+13  1.75656671e+15  1.84984163e+15  9.46461315e+14
  3.02020722e+15  1.52612301e+15  2.68107098e+14  9.64756183e+14
  2.21417104e+15  2.28385712e+15  4.01401144e+14  1.44360856e+15
  1.40752078e+14  1.96217460e+15  1.71736568e+15  1.22305144e+15
  1.49419515e+15  2.40931728e+15  1.43147677e+15  4.30181611e+15
  6.05755939e+15  5.75345981e+15  1.82313240e+15  2.64785230e+15]


In [71]:
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X_train_scaled_df, y_train)
y_pred_lasso = lasso_model.predict(X_test_scaled_df)
print(f'Coefficients: {lasso_model.coef_}')


Coefficients: [  11.04393841   64.88565274  -62.6023335    11.42376996   58.36437738
   80.26505848  -34.49285878    5.97631056  -27.69200985   -0.
   10.48684129    9.12558213    0.          -13.18110783   18.74475616
    0.19120967  -22.26748764   19.23763918 -117.39528229    0.
   69.0785065   -22.55947256  -55.13410859   -0.69380485    0.
   47.79828486    1.56996263   -4.42506118   -0.           67.97430106
   -6.15864847  -28.41078584   -4.21080377   -0.            3.34221934
  -23.35823486  -26.47348229  -20.48716014   13.50576415  -27.66960515
   43.35170476   81.80485506   -0.          -33.27969906]


In [72]:
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled_df, y_train)
y_pred_ridge = ridge_model.predict(X_test_scaled_df)
print(f'Coefficients: {ridge_model.coef_}')


Coefficients: [ 21.75121031  63.13446897 -71.66662684  14.3134738   59.36587561
  80.19820485 -31.57454365   6.58929007 -25.66379001   3.89638158
  14.96204929  14.9426575    1.09240731 -12.11181348  21.62165964
  27.6751212    3.48343924  28.16371581 -73.2567249   34.10455652
  82.90567222 -16.95768717 -50.61885585   2.45236953  10.61577321
  54.25824721   6.58929007 -16.96525375 -30.48542237  38.83817329
 -12.04173429 -23.24262259  -4.24040602   7.29622662  11.548865
 -29.35689272 -33.60932555 -29.92908238   9.23328713 -45.23560264
  20.7200658   59.79505005  -7.49409999 -43.91036663]


from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=5)
X_train_poly = poly_features.fit_transform(X_train_scaled_df)
X_test_poly = poly_features.transform(X_test_scaled_df)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)
print(f'Coefficients: {poly_model.coef_}')



In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_model = DecisionTreeRegressor(max_depth=3, min_samples_split=6)
tree_model.fit(X_train_scaled_df, y_train)
y_pred_tree = tree_model.predict(X_test_scaled_df)


In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
plot_tree(tree_model, filled=True, feature_names=X_test_scaled_df.columns, max_depth=3, fontsize=10)
plt.show()


**Evaluation**

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
model_names = ['Linear', 'Lasso', 'Ridge', 'Polynomial', 'Decision Tree']
predictions = [y_pred_linear, y_pred_lasso, y_pred_ridge, y_pred_poly, y_pred_tree]
results = []

for model_name, pred in zip(model_names, predictions):
    mse = mean_squared_error(y_test, pred)
    r2 = r2_score(y_test, pred)
    results.append({"Model": model_name, "MSE": mse, "R2": r2})

results_df = pd.DataFrame(results)

In [None]:
results_df


In our analysis, the Ridge, Lasso, and Linear Regression models demonstrated the best performance and achieved comparable results. Among these, the Ridge Regression model emerged as the top performer, indicating its effectiveness in handling the dataset and predicting laptop prices with a high level of accuracy. On the other hand, the Polynomial Regression model did not perform as well, showing significantly lower effectiveness in this particular case. This outcome suggests that while Polynomial Regression can be powerful for capturing non-linear relationships, it may not always be the most suitable choice, especially if the data does not exhibit strong non-linear patterns or if the model becomes too complex and overfits the data.

In [None]:
import matplotlib.pyplot as plt
residuals = y_test - y_pred_poly
plt.scatter(y_pred_poly, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Przewidywane wartości')
plt.ylabel('Reszty')
plt.title('Wykres reszt')
plt.show()