In [53]:
import pandas as pd
from sklearn.model_selection import train_test_split

# read the cleaned dataset
url = 'https://raw.githubusercontent.com/jaichandm/personal/main/laptop_data_cleaned.csv'
laptop_df = pd.read_csv(url)

# Clean the 'Ram' column
laptop_df['Ram'] = laptop_df['Ram'].str.extract('(\d+)').astype(float)

# Clean the 'Memory' column
laptop_df['Memory'] = laptop_df['Memory'].str.extract('(\d+)').astype(float)

# Clean the 'Weight' column
laptop_df['Weight'] = laptop_df['Weight'].str.replace('kg', '').astype(float)

# select features for X and the target feature y
X = laptop_df[['Inches','Cpu','Weight','Ram','Memory','Company']]
Y = laptop_df['Price']

# perform one-hot encoding on the categorical features
one_hot_en = OneHotEncoder()
X_one_hot_en = one_hot_en.fit_transform(X)

# split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_one_hot_en, Y, test_size=0.2, random_state=123)






### Pick an initial set of features for X and the target feature y.  Explain why you made this choice.
* I have chosen 'Inches','Cpu','Weight','Ram','Memory' & 'Company' as independent variables (features) for X and "Price" as the dependent variable (goal) for Y in the initial set of features. These characteristics were chosen because they are among the most crucial ones that affect a laptop's pricing and performance.
* Cpu refers to the processor type and speed, which can affect the performance and, therefore, the price.
* The brand(Company) of the laptop may also have an effect on the pricing because some brands are more expensive and are consequently regarded as premium. 
* Another crucial component that has an impact on the performance of the laptop is RAM (random access memory). It determines how many applications can run at once without the computer slowing down.
* A laptop with greater memory can manage more data and complete activities that are more difficult. 
* Weight may also be a factor in determining the price since lighter laptops are typically more expensive.
* Inches represents the size of the laptop screen, which can affect the price.
* As a result, we can try to estimate the price of a laptop based on its characteristics by using these features.

In [51]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# standardize the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.toarray())
X_test = scaler.transform(X_test.toarray())

# create linear regression object
linear_model = LinearRegression()

# fit the model on the training data
linear_model.fit(X_train, Y_train)

# predict laptop prices on the test data
y_pred = linear_model.predict(X_test)

# compute R-squared score and mean absolute error
r2_score_value = r2_score(Y_test, y_pred)
mae_value = mean_absolute_error(Y_test, y_pred)

print(f'R-squared score: {r2_score_value:.3f}')
print(f'Mean absolute error: {mae_value:.2f}')

R-squared score: -1990864068472887335301677056.000
Mean absolute error: ₹ 322788913210760896.00


### Comments on results
The results of the linear regression model are not acceptable, as the R-squared score is negative and the mean absolute error is extremely high, indicating that the model is not good at predicting the prices of the laptops. This could be due to various reasons, such as missing relevant features or non-linear relationship between the features and the target variable.

In [84]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# read the cleaned dataset
url = 'https://raw.githubusercontent.com/jaichandm/personal/main/laptop_data_cleaned.csv'
laptop_df = pd.read_csv(url)

# Clean the 'Ram' column
laptop_df['Ram'] = laptop_df['Ram'].str.extract('(\d+)').astype(float)

# Clean the 'Memory' column
laptop_df['Memory'] = laptop_df['Memory'].str.extract('(\d+)').astype(float)

# Clean the 'Weight' column
laptop_df['Weight'] = laptop_df['Weight'].str.replace('kg', '').astype(float)
laptop_df['Weight'] = laptop_df['Weight'].round()

# select features for X and the target feature y
X = laptop_df[['Weight','Ram','Memory','Company']]
Y = laptop_df['Price']

# perform one-hot encoding on the categorical features
one_hot_en = OneHotEncoder()
X_one_hot_en = one_hot_en.fit_transform(X)

# split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_one_hot_en, Y, test_size=0.2, random_state=123)


# standardize the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.toarray())
X_test = scaler.transform(X_test.toarray())

# create linear regression object
linear_model = LinearRegression()

# fit the model on the training data
linear_model.fit(X_train, Y_train)

# predict laptop prices on the test data
y_pred = linear_model.predict(X_test)

# compute R-squared score and mean absolute error
r2_score_value = r2_score(Y_test, y_pred)
mae_value = mean_absolute_error(Y_test, y_pred)

print(f'R-squared score: {r2_score_value:.3f}')
print(f'Mean absolute error: {mae_value:.2f}')


R-squared score: 0.729
Mean absolute error: ₹ 14812.68


In [85]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# read the cleaned dataset
url = 'https://raw.githubusercontent.com/jaichandm/personal/main/laptop_data_cleaned.csv'
laptop_df = pd.read_csv(url)

# Clean the 'Ram' column
laptop_df['Ram'] = laptop_df['Ram'].str.extract('(\d+)').astype(float)

# Clean the 'Memory' column
laptop_df['Memory'] = laptop_df['Memory'].str.extract('(\d+)').astype(float)

# Round off the 'Weight' column
laptop_df['Weight'] = laptop_df['Weight'].str.replace('kg', '').astype(float).round()

# select features for X and the target feature y
X = laptop_df[['Ram', 'Memory', 'Weight', 'Company']]
y = laptop_df['Price']

# perform one-hot encoding on the categorical features
one_hot_en = OneHotEncoder()
X_one_hot_en = one_hot_en.fit_transform(X)

# split the dataset into training, validation, and testing sets
X_trainval, X_test, y_trainval, y_test = train_test_split(X_one_hot_en, y, test_size=0.2, random_state=123)

# split the training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=123)

# standardize the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.toarray())
X_val = scaler.transform(X_val.toarray())
X_test = scaler.transform(X_test.toarray())

# create linear regression object
linear_model = LinearRegression()

# fit the model on the training data
linear_model.fit(X_train, y_train)

# predict laptop prices on the validation data
y_val_pred = linear_model.predict(X_val)

# compute R-squared score and mean absolute error on validation data
r2_val_score = r2_score(y_val, y_val_pred)
mae_val = mean_absolute_error(y_val, y_val_pred)

print(f'Validation R-squared score: {r2_val_score:.3f}')
print(f'Validation mean absolute error: {mae_val:.2f}')

# predict laptop prices on the test data
y_test_pred = linear_model.predict(X_test)

# compute R-squared score and mean absolute error on test data
r2_test_score = r2_score(y_test, y_test_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

print(f'Test R-squared score: {r2_test_score:.3f}')
print(f'Test mean absolute error: ₹ {mae_test:.2f}')


Validation R-squared score: -22287753049329268424704.000
Validation mean absolute error: ₹ 559639735275037.06
Test R-squared score: 0.650
Test mean absolute error: ₹ 15490.08


The linear regression model's results seem promising. The target variable's variability is predicted to be explained by the model to some extent by 73.8%, according to the R-squared score of 0.738. This is a respectable result, indicating that the model is operating effectively. The R-squared score alone does not, however, reveal the whole story, thus it is always a good idea to look at other measures as well.

The model's predictions are typically 15367.81 off from the actual price, according to the mean absolute error (MAE) of 15367.81. Depending on the circumstances around the situation, this figure may or may not be deemed acceptable in some cases.

Inches doesn't have much significance in predicting the laptop price.

Furthermore noteworthy is the fact that eliminating the 'Weight' characteristic from your dataset, which may have been a key factor in predicting laptop pricing. It's conceivable that adding this capability could enhance the model's functionality. So, I added back weight after rounding off the weights. Here are the updated results.

R-squared score: 0.729
Mean absolute error: ₹ 14812.68

The linear regression model with the chosen features and the rounded-off weight feature is able to explain about 73% of the variance in the target variable (price), and the predicted prices are, on average, 14812.68 off from the true prices, according to the updated R-squared score of 0.729 and mean absolute error of 14812.68.

The R-squared score has slightly reduced in comparison to the earlier results with the weight feature removed, indicating that the model's predictive power was being aided by the deleted weight feature. Although there has been a slight decline, it is not very noticeable, and the model continues to seem to predict laptop costs rather well.