In [39]:
import pandas as pd
import numpy as np

### Inspecting unstructured data
- While looking the data, there are certain data which are not properly structured. 
- They are neither numerical or categorical
- But they do contain some insights. So we need to analyse them by mocking the preprocessing step. 
- This not final preprocessing, but we will be doing some trials to get the best out of the data

In [40]:
df = data_path = '../data_source/laptop_data.xlsx'
df = pd.read_excel(data_path)

In [41]:
df['Ram']

0        8GB
1        8GB
2        8GB
3       16GB
4        8GB
        ... 
1298     4GB
1299    16GB
1300     2GB
1301     6GB
1302     4GB
Name: Ram, Length: 1303, dtype: object

## Intuition Report: RAM Feature Preprocessing

### 1. Feature Overview
The `Ram` column represents the **memory capacity** of each laptop in gigabytes (GB). Initially, the data was stored as **string values** such as `"8GB"`, `"16GB"`, etc. Although human-readable, this representation is **not directly usable** for numerical analysis or modeling.

---

### 2. Problem with Original Format
- The presence of the `"GB"` text suffix makes the column **non-numeric**, which restricts:
  - Statistical computations (e.g., mean, correlation)
  - Visualization scaling (e.g., histograms, box plots)
  - Feature engineering (e.g., normalization, clustering)

Hence, even though `"8GB"` and `"16GB"` carry quantitative meaning, the system interprets them as plain text.

---

### 3. Preprocessing Intuition
To make this feature analytically useful:
1. **Text Cleaning** – Remove the `"GB"` suffix to isolate the numeric part.  
   Example: `"8GB"` → `"8"`
2. **Type Conversion** – Convert the cleaned text into an **integer datatype** (`int32`), enabling mathematical operations.

This ensures that the `Ram` feature can be interpreted quantitatively — representing **actual hardware capacity** rather than a string label.

---

### 4. Analytical Benefit
Once converted to numeric form, the `Ram` column can now:
- Participate in **descriptive statistics** and **correlation analysis**.
- Be used in **model training**, contributing to performance prediction or price estimation.
- Serve as a **scalable and comparable metric** across all entries.

---

### 5. Intuitive Takeaway
In short, the transformation from `"8GB"` → `8` reflects a shift from **human-readable text** to **machine-understandable numeric format**, a key step in preparing structured data for meaningful analysis and modeling.


In [42]:
df['Ram'] = df['Ram'].str.replace('GB','').astype('int32')


In [43]:
df['Ram']


0        8
1        8
2        8
3       16
4        8
        ..
1298     4
1299    16
1300     2
1301     6
1302     4
Name: Ram, Length: 1303, dtype: int32

In [44]:
df['Weight'] 


0       1.37kg
1       1.34kg
2       1.86kg
3       1.83kg
4       1.37kg
         ...  
1298     1.8kg
1299     1.3kg
1300     1.5kg
1301    2.19kg
1302     2.2kg
Name: Weight, Length: 1303, dtype: object

## Intuition Report: Weight and Price Feature Preprocessing

### 1. Feature Overview
The dataset includes two continuous features:
- **`Weight`** → Indicates the physical mass of each laptop, initially recorded with a `'kg'` unit suffix.  
- **`Price`** → Represents the product’s cost, already stored as a numeric value.

Both attributes are **quantitative**, but their usability in analysis depends on proper data typing and formatting.

---

### 2. Problem with Original Weight Format
Initially, the `Weight` column contained values like `"1.37kg"`, `"1.8kg"`, etc.  
This textual representation prevents numerical operations such as:
- Calculating averages or weight distribution.
- Correlating with other numeric features (e.g., `Price`, `Inches`, `Ram`).
- Scaling and normalization for modeling.

The `'kg'` suffix introduces **non-numeric noise**, even though the value itself is numerical in nature.

---

### 3. Preprocessing Intuition
To prepare `Weight` for quantitative analysis:
1. **Remove the textual unit (`'kg'`)** — stripping non-numeric characters.
2. **Convert to a numeric datatype (`float32`)** — enabling mathematical operations.

This conversion allows the column to truly represent physical weight values on a continuous numeric scale.

---

### 4. Role of Price Feature
The `Price` column is already numeric but needs to be **validated** for consistency:
- Ensure there are **no string or currency symbols** (like “₹” or “USD”).
- Maintain it as a **float or int** type to support statistical and model-based analysis.

---

### 5. Analytical Value
Post preprocessing:
- `Weight` becomes a **continuous variable**, suitable for regression, trend, and correlation analysis.
- `Price` remains a **target or dependent variable**, which can be predicted or explained by other features like `Ram`, `Inches`, and `Weight`.

---

### 6. Intuitive Takeaway
The transformation from `"1.37kg"` → `1.37` signifies **unit normalization** — translating human-readable data into a **machine-understandable numeric scale**.  
Such preprocessing ensures both `Weight` and `Price` align with analytical workflows, making them ready for **visual exploration, feature scaling, and predictive modeling**.


In [45]:
df['Weight'] = df['Weight'].str.replace('kg','').astype('float32')

In [46]:
df['Weight']

0       1.37
1       1.34
2       1.86
3       1.83
4       1.37
        ... 
1298    1.80
1299    1.30
1300    1.50
1301    2.19
1302    2.20
Name: Weight, Length: 1303, dtype: float32

In [47]:
df['Price'] 

0        71378.6832
1        47895.5232
2        30636.0000
3       135195.3360
4        96095.8080
           ...     
1298     33992.6400
1299     79866.7200
1300     12201.1200
1301     40705.9200
1302     19660.3200
Name: Price, Length: 1303, dtype: float64

## Intuition Report: Price Feature Formatting

### 1. Feature Overview
The `Price` column captures the **monetary value** of each laptop.  
Initially, it was stored as a **floating-point (`float64`)** number, containing decimal precision such as `71378.6832`.  
However, in most retail or consumer datasets, prices are typically represented in **whole currency units** rather than fractional decimals.

---

### 2. Problem with Floating Representation
Although decimals provide precision, they can introduce **unnecessary granularity** when:
- Prices are inherently rounded to the nearest unit (e.g., ₹47,895 rather than ₹47,895.52).
- Models or reports require **consistent integer-based values**.
- Data visualization or aggregation demands **clean categorical grouping** by price ranges.

Retaining floating values can also slightly increase **memory consumption** and **computation overhead** during large-scale processing.

---

### 3. Preprocessing Intuition
To align the `Price` feature with practical analysis needs:
1. **Type Conversion** → Convert from `float64` to `int`.  
   This transformation discards fractional parts and stores prices as whole numbers.
2. This step maintains the **true magnitude** of the data while simplifying downstream usage.

---

### 4. Analytical Benefit
Post conversion, the `Price` feature:
- Becomes **simpler to interpret** and visualize.
- Aligns naturally with **real-world pricing conventions** (no decimal fractions in currency).
- Supports **categorical binning**, **correlation**, and **machine learning regression** with minimal noise.

---

### 5. Intuitive Takeaway
The conversion from `71378.6832 → 71378` reflects a deliberate move from **over-precision to practical clarity**.  
It preserves interpretability and analytical stability, ensuring that `Price` behaves as a **clean numerical target variable** for trend, distribution, or predictive analysis.


In [48]:
df['Price'] = df['Price'].astype(int)

In [49]:
df['Price']

0        71378
1        47895
2        30636
3       135195
4        96095
         ...  
1298     33992
1299     79866
1300     12201
1301     40705
1302     19660
Name: Price, Length: 1303, dtype: int32

In [50]:
df['ScreenResolution']


0               IPS Panel Retina Display 2560x1600
1                                         1440x900
2                                Full HD 1920x1080
3               IPS Panel Retina Display 2880x1800
4               IPS Panel Retina Display 2560x1600
                           ...                    
1298     IPS Panel Full HD / Touchscreen 1920x1080
1299    IPS Panel Quad HD+ / Touchscreen 3200x1800
1300                                      1366x768
1301                                      1366x768
1302                                      1366x768
Name: ScreenResolution, Length: 1303, dtype: object

## Intuition Report: ScreenResolution Feature Engineering

### 1. Feature Overview
The `ScreenResolution` column describes the **display characteristics** of each laptop — combining textual and numerical information such as:
- Display technology (e.g., *IPS Panel*, *Retina Display*)
- Touch capability (e.g., *Touchscreen*)
- Resolution (e.g., *1920x1080*)

This feature contains **multiple attributes embedded in a single string**, which makes it **semi-structured** and not directly usable for quantitative analysis.

---

### 2. Problem with Original Format
As a single text field, `ScreenResolution` mixes descriptive and numeric data.  
This leads to challenges like:
- Difficulty in isolating **individual display properties**.
- Inability to use **resolution values** in mathematical computations.
- Textual redundancy that limits **pattern recognition** or **model interpretability**.

To extract meaningful signals, we must **decompose** the field into separate analytical components.

---

### 3. Preprocessing Intuition

#### a. Touchscreen Detection
- Create a new binary feature `Touchscreen`:
  - `1` → Laptop has a touchscreen display.
  - `0` → Non-touchscreen.
- Intuition: Converts descriptive text into a **machine-readable flag** representing interactivity.

#### b. IPS Panel Detection
- Create another binary feature `IPS`:
  - `1` → Laptop includes an IPS display.
  - `0` → Otherwise.
- Intuition: Captures display **quality and color-accuracy indicator**, transforming qualitative detail into a structured format.

#### c. Resolution Extraction
- Split the resolution part (e.g., `"1920x1080"`) into two numeric columns:
  - `x_res` → Horizontal pixel count.
  - `y_res` → Vertical pixel count.
- This step extracts the **core quantitative aspect** of screen specification.

---

### 4. PPI (Pixels Per Inch) Calculation
- Using the formula:  
  \[
  \text{ppi} = \frac{\sqrt{(x\_res)^2 + (y\_res)^2}}{\text{Inches}}
  \]
- Intuition:
  - PPI measures **pixel density**, a key indicator of display sharpness.
  - Higher PPI implies finer detail and better visual clarity.
- After computing, `ppi` replaces raw resolution and inch size, offering a **single continuous metric** that summarizes screen quality.

---

### 5. Cleanup
- Drop redundant columns:
  - `ScreenResolution`, `Inches`, `x_res`, and `y_res`.
- This ensures the dataset remains **compact and non-redundant**, retaining only the derived, analysis-ready features (`Touchscreen`, `IPS`, and `ppi`).

---

### 6. Intuitive Takeaway
This transformation converts `ScreenResolution` from a **mixed descriptive text** into **three clean analytical dimensions**:
1. **Touchscreen** → Interactivity (binary feature)  
2. **IPS** → Display type (binary feature)  
3. **PPI** → Display quality (continuous feature)  

Together, they translate textual complexity into **quantifiable, interpretable signals** that enhance both statistical and predictive modeling workflows.


In [51]:
df['Touchscreen']=df['ScreenResolution'].apply(lambda x:1 if 'Touchscreen' in x else 0)


In [52]:
df['Touchscreen']

0       0
1       0
2       0
3       0
4       0
       ..
1298    1
1299    1
1300    0
1301    0
1302    0
Name: Touchscreen, Length: 1303, dtype: int64

In [53]:
df['IPS']=df['ScreenResolution'].apply(lambda x:1 if 'IPS' in x else 0)


In [54]:
df['IPS']

0       1
1       0
2       0
3       1
4       1
       ..
1298    1
1299    1
1300    0
1301    0
1302    0
Name: IPS, Length: 1303, dtype: int64

In [55]:
df['Cpu']

0                       Intel Core i5 2.3GHz
1                       Intel Core i5 1.8GHz
2                 Intel Core i5 7200U 2.5GHz
3                       Intel Core i7 2.7GHz
4                       Intel Core i5 3.1GHz
                        ...                 
1298              Intel Core i7 6500U 2.5GHz
1299              Intel Core i7 6500U 2.5GHz
1300    Intel Celeron Dual Core N3050 1.6GHz
1301              Intel Core i7 6500U 2.5GHz
1302    Intel Celeron Dual Core N3050 1.6GHz
Name: Cpu, Length: 1303, dtype: object

In [56]:
df['Cpu Name'] = df['Cpu'].apply(lambda x:" ".join(x.split()[0:3]))


In [57]:
df['ScreenResolution']

0               IPS Panel Retina Display 2560x1600
1                                         1440x900
2                                Full HD 1920x1080
3               IPS Panel Retina Display 2880x1800
4               IPS Panel Retina Display 2560x1600
                           ...                    
1298     IPS Panel Full HD / Touchscreen 1920x1080
1299    IPS Panel Quad HD+ / Touchscreen 3200x1800
1300                                      1366x768
1301                                      1366x768
1302                                      1366x768
Name: ScreenResolution, Length: 1303, dtype: object

In [58]:
new=df['ScreenResolution'].str.split('x',n=1,expand = True)

df['x_res'] = new[0]
df['y_res'] = new[1]


In [59]:
df['x_res']


0               IPS Panel Retina Display 2560
1                                        1440
2                                Full HD 1920
3               IPS Panel Retina Display 2880
4               IPS Panel Retina Display 2560
                        ...                  
1298     IPS Panel Full HD / Touchscreen 1920
1299    IPS Panel Quad HD+ / Touchscreen 3200
1300                                     1366
1301                                     1366
1302                                     1366
Name: x_res, Length: 1303, dtype: object

In [60]:
df['y_res']

0       1600
1        900
2       1080
3       1800
4       1600
        ... 
1298    1080
1299    1800
1300     768
1301     768
1302     768
Name: y_res, Length: 1303, dtype: object

# Screen Resolution Feature Extraction and PPI Calculation

## Dataset Overview
The `ScreenResolution` column contains the display resolution of laptops. Example entries:

- IPS Panel Retina Display 2560x1600  
- 1440x900  
- Full HD 1920x1080  
- IPS Panel Retina Display 2880x1800  

These values include both textual descriptions and numeric resolution values.

---

## Step 1: Split Resolution into X and Y

**Intuition:**  
To calculate pixel density (PPI), we first need the horizontal (`x_res`) and vertical (`y_res`) resolution. The resolution is separated by an "x".  

## Step 2: Extract Numeric Values
**Intuition:**  
Some `x_res` values contain extra text like "IPS Panel Retina Display" or "Full HD". Using pattern matching (`\d+`), we extract only the numeric portion.  

After this step, `x_res` and `y_res` are clean integers representing horizontal and vertical pixels.

---

## Step 3: Convert to Integer
**Intuition:**  
To perform mathematical operations, both `x_res` and `y_res` are converted from strings to integers.  

---

## Step 4: Calculate PPI (Pixels Per Inch)
**Intuition:**  
PPI measures display sharpness. Formula:

\[
\text{PPI} = \frac{\sqrt{(\text{x_res}^2 + \text{y_res}^2)}}{\text{screen size in inches}}
\]

- Uses Pythagoras theorem to calculate diagonal resolution.  
- Divides by screen size (`Inches`) to normalize.  
- Result is stored in a new column `ppi`.


---

## Summary
1. Split resolution into horizontal and vertical components.  
2. Clean numeric values from text.  
3. Convert to integer for calculations.  
4. Compute PPI to quantify screen sharpness, a key feature for display analysis.


In [61]:
df['x_res'] = df['x_res'].str.replace(',','').str.findall(r'(\d+\.?\d+)').apply(lambda x:x[0])

In [62]:

df['x_res'] = df['x_res'].astype(int)
df['y_res'] = df['y_res'].astype(int)


In [63]:
df['x_res']

0       2560
1       1440
2       1920
3       2880
4       2560
        ... 
1298    1920
1299    3200
1300    1366
1301    1366
1302    1366
Name: x_res, Length: 1303, dtype: int32

In [64]:
df['y_res']

0       1600
1        900
2       1080
3       1800
4       1600
        ... 
1298    1080
1299    1800
1300     768
1301     768
1302     768
Name: y_res, Length: 1303, dtype: int32

In [65]:
df['ppi'] = (((df['x_res']**2) + (df['y_res']**2))**0.5/df['Inches']).astype(float)

In [66]:
df['ppi']

0       226.983005
1       127.677940
2       141.211998
3       220.534624
4       226.983005
           ...    
1298    157.350512
1299    276.053530
1300    111.935204
1301    100.454670
1302    100.454670
Name: ppi, Length: 1303, dtype: float64

In [None]:
df.drop(columns=['ScreenResolution','Inches','x_res','y_res'],inplace=True)

In [68]:
df.to_csv("../data_source/laptop_data_preprocessed.csv")