In [33]:
import pandas as pd

# read the dataset
url = 'https://raw.githubusercontent.com/jaichandm/personal/main/laptop_data.csv'
laptop_df = pd.read_csv(url)

# get dataset size

print(f'Initial Dataset:\trows= {laptop_df.shape[0]}\tcolumns= {laptop_df.shape[1]}\n')

# display the first 5 rows of the dataset
print(f'{laptop_df.head()}\n')

# check for missing values
print(f'missing values:\n{laptop_df.isnull().sum()}\n')

print('Number of Duplicate entries: ',laptop_df.duplicated().sum())


Initial Dataset:	rows= 1303	columns= 12

   Unnamed: 0 Company   TypeName  Inches                    ScreenResolution  \
0           0   Apple  Ultrabook    13.3  IPS Panel Retina Display 2560x1600   
1           1   Apple  Ultrabook    13.3                            1440x900   
2           2      HP   Notebook    15.6                   Full HD 1920x1080   
3           3   Apple  Ultrabook    15.4  IPS Panel Retina Display 2880x1800   
4           4   Apple  Ultrabook    13.3  IPS Panel Retina Display 2560x1600   

                          Cpu   Ram               Memory  \
0        Intel Core i5 2.3GHz   8GB            128GB SSD   
1        Intel Core i5 1.8GHz   8GB  128GB Flash Storage   
2  Intel Core i5 7200U 2.5GHz   8GB            256GB SSD   
3        Intel Core i7 2.7GHz  16GB            512GB SSD   
4        Intel Core i5 3.1GHz   8GB            256GB SSD   

                            Gpu  OpSys  Weight        Price  
0  Intel Iris Plus Graphics 640  macOS  1.37kg   71378.

### Did you discover interesting relations?

We have pulled in the "Laptop Prices" dataset from an online source using pandas library. After loading the dataset, we checked the first 5 rows to get an idea of what the data looks like. We also checked for any missing and duplicate values in the dataset and found none. We donot require 'Unnamed: 0' column so we can remove it

After exploring the dataset, we will discover that laptops with higher-end specifications such as faster processors, more RAM, larger storage capacities, and dedicated graphics cards tend to have higher prices.


### What feature/s would you like to be able to predict?

We would like to be able to predict the price of a laptop based on its specifications and features such as CPU, RAM, screen size, storage, brand, etc.

In [34]:
# drop any rows with missing values
laptop_df = laptop_df.dropna()

# drop any duplicate rows
laptop_df = laptop_df.drop_duplicates()

# drop 'Unnamed: 0' column
laptop_df.drop(columns=['Unnamed: 0'],inplace=True)

print(f'After cleaning Dataset:\trows= {laptop_df.shape[0]}\tcolumns= {laptop_df.shape[1]}\n')


After cleaning Dataset:	rows= 1303	columns= 11



In [36]:
from sklearn.model_selection import train_test_split

# remove rows with missing values
data_cleaned = laptop_df.dropna()

# remove columns with missing values
data_cleaned = laptop_df.dropna(axis=1)

# split the dataset into training and testing sets
data_cleaned_train, data_cleaned_test = train_test_split(data_cleaned, test_size=0.2, random_state=123)


# get dataset size
train_num_rows, train_num_cols = data_cleaned_train.shape
test_num_rows, test_num_cols = data_cleaned_test.shape
print("Taining Dataset")
print("Number of rows: ", train_num_rows)
print("Number of columns: ", train_num_cols)

print("Test Dataset")
print("Number of rows: ", test_num_rows)
print("Number of columns: ", test_num_cols)

Taining Dataset
Number of rows:  1042
Number of columns:  11
Test Dataset
Number of rows:  261
Number of columns:  11
