# Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a critical step in understanding the characteristics of your dataset. It involves analyzing and visualizing the data to gain insights into its structure, relationships, and patterns.


# Importing Libraries

To perform linear regression, we need to import the following libraries:

- **NumPy**: NumPy is a powerful library for numerical computing in Python. It provides support for arrays, matrices, and mathematical functions.
- **Pandas**: Pandas is a data manipulation library that provides data structures like DataFrame, which is particularly useful for handling structured data.
- **Matplotlib**: Matplotlib is a plotting library that allows us to create various types of plots, such as line plots, scatter plots, and histograms.
- **Seaborn**: Seaborn is a statistical data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- **Scikit-learn**: Scikit-learn is a machine learning library that provides various tools for data mining and data analysis. It includes support for various machine learning algorithms, including linear regression.

``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

In [2]:
data_path = './data/smartwatches.csv'

In [3]:
df = pd.read_csv(data_path)

In [4]:
df.drop_duplicates(inplace=True)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Brand,Current Price,Original Price,Discount Percentage,Rating,Number OF Ratings,Model Name,Dial Shape,Strap Color,Strap Material,Touchscreen,Battery Life (Days),Bluetooth,Display Size,Weight
0,0,noise,82990.0,89900.0,7.686318,4.0,65.0,Wrb-sw-colorfitpro4alpha-std-rgld_pnk,,,,,8.0,Yes,,35 - 50 g
1,1,fire-boltt,3799.0,16999.0,77.651627,4.3,20788.0,BSW046,,,Silicon,Yes,3.5,Yes,1.8 inches,50 - 75 g
2,2,boat,1999.0,7990.0,74.981227,3.8,21724.0,Wave Call,,,Silicon,Yes,8.0,Yes,1.7 inches,35 - 50 g
3,3,fire-boltt,1799.0,19999.0,91.00455,4.3,13244.0,BSW053,,,Silicon,Yes,3.5,Yes,1.8 inches,75g +
4,4,noise,1599.0,4999.0,68.013603,4.1,13901.0,Wrb-sw-colorfitpulsegobuzz-std-blk_blk,,,Other,Yes,8.0,Yes,1.7 inches,35 - 50 g


In [6]:
df.shape

(450, 16)

# Splitting Data into Training and Testing Sets

Before building our linear regression model, it's essential to split our dataset into training and testing sets. This allows us to train the model on one portion of the data and evaluate its performance on another unseen portion. The common practice is to use a larger portion of the data for training and a smaller portion for testing.

We can use the `train_test_split` function from the Scikit-learn library to accomplish this. This function randomly splits the dataset into training and testing sets based on a specified ratio.

``` python
from sklearn.model_selection import train_test_split

# Splitting the dataset into features (X) and target variable (y)

# Splitting the data into 80% training and 20% testing
- Train Data(360, 16) 
- Test Data (90, 16)


In [7]:
train_df = df.sample(frac=0.8, random_state=42)
# train_df = df.sample(frac=0.8, random_state=42)
# test_df = df.drop(train_df.index)

In [8]:
train_df = train_df.reset_index()

In [9]:
test_df = df.drop (train_df.index)

In [10]:
# train_df = test_df.reset_index()

In [11]:
print(train_df.shape , test_df.shape)

(360, 17) (90, 16)


In [12]:
train_df.tail()

Unnamed: 0.1,index,Unnamed: 0,Brand,Current Price,Original Price,Discount Percentage,Rating,Number OF Ratings,Model Name,Dial Shape,Strap Color,Strap Material,Touchscreen,Battery Life (Days),Bluetooth,Display Size,Weight
355,216,216,fitbit,11699.0,14999.0,22.001467,4.3,3999.0,Versa 2,Square,Black,Silicon,Yes,3.5,Yes,1.3 inches,20 - 35 g
356,279,279,garmin,39490.0,44990.0,12.224939,4.7,109.0,"Instinct 2, Rugged Outdoor Watch with GPS, Bui...",Circle,Grey,Silicon,No,17.5,Yes,0.9 inches,
357,390,390,zebronics,2199.0,4999.0,56.011202,3.9,272.0,Zeb-Fit Me,Square,Green,Thermo Plastic Polyurethene,Yes,22.0,Yes,3.3 inches,20 - 35 g
358,337,337,gizmore,1199.0,4499.0,73.349633,4.7,,GizFit CLOUD 1.85 IPS Large Display | AI Voice...,Square,Blue,Silicon,Yes,22.0,Yes,1.8 inches,20 - 35 g
359,236,236,fitbit,20499.0,,,4.7,,Fitbit Versa 4 Fitness Watch (Waterfall Blue /...,Curved,Blue,Rubber,Yes,22.0,Yes,0.2 inches,


In [13]:
train_df.dtypes

index                    int64
Unnamed: 0               int64
Brand                   object
Current Price          float64
Original Price         float64
Discount Percentage    float64
Rating                 float64
Number OF Ratings      float64
Model Name              object
Dial Shape              object
Strap Color             object
Strap Material          object
Touchscreen             object
Battery Life (Days)    float64
Bluetooth               object
Display Size            object
Weight                  object
dtype: object

In [14]:
train_df.isna().sum()

index                    0
Unnamed: 0               0
Brand                    0
Current Price            6
Original Price          56
Discount Percentage     56
Rating                   4
Number OF Ratings       45
Model Name              30
Dial Shape             100
Strap Color            100
Strap Material          56
Touchscreen             31
Battery Life (Days)     30
Bluetooth                5
Display Size            27
Weight                 149
dtype: int64

In [15]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                360 non-null    int64  
 1   Unnamed: 0           360 non-null    int64  
 2   Brand                360 non-null    object 
 3   Current Price        354 non-null    float64
 4   Original Price       304 non-null    float64
 5   Discount Percentage  304 non-null    float64
 6   Rating               356 non-null    float64
 7   Number OF Ratings    315 non-null    float64
 8   Model Name           330 non-null    object 
 9   Dial Shape           260 non-null    object 
 10  Strap Color          260 non-null    object 
 11  Strap Material       304 non-null    object 
 12  Touchscreen          329 non-null    object 
 13  Battery Life (Days)  330 non-null    float64
 14  Bluetooth            355 non-null    object 
 15  Display Size         333 non-null    obj

# Identifying Numerical and Categorical Data

Before proceeding with linear regression, it's crucial to understand the types of features in our dataset. Features can be broadly categorized into numerical (continuous) and categorical (discrete) data.

**Numerical Data**: Numerical features represent continuous data that can take on any value within a range. These features are typically quantitative and can be measured on a continuous scale. Examples include age, height, weight, and income.

['index',
 'Unnamed: 0',
 'Current Price',
 'Original Price',
 'Discount Percentage',
 'Rating',
 'Number OF Ratings',
 'Battery Life (Days)']

**Categorical Data**: Categorical features represent discrete data that fall into specific categories or groups. These features are typically qualitative and can take on a limited number of distinct values. Examples include gender, marital status, and occu

['Brand',
 'Model Name',
 'Dial Shape',
 'Strap Color',
 'Strap Material',
 'Touchscreen',
 'Bluetooth',
 'Display Size',
 'Weight']pation.

We can use the Pandas library to identify numerical and categorical features in our dataset.



In [16]:
numerical_data = [feature for feature in train_df.columns if train_df [feature].dtype != 'object']

In [17]:
continous_data = [feature for feature in train_df.columns if train_df [feature].dtype == 'object']

In [18]:
numerical_data

['index',
 'Unnamed: 0',
 'Current Price',
 'Original Price',
 'Discount Percentage',
 'Rating',
 'Number OF Ratings',
 'Battery Life (Days)']

In [19]:
continous_data


['Brand',
 'Model Name',
 'Dial Shape',
 'Strap Color',
 'Strap Material',
 'Touchscreen',
 'Bluetooth',
 'Display Size',
 'Weight']

In [20]:
train_df[numerical_data].describe()

Unnamed: 0.1,index,Unnamed: 0,Current Price,Original Price,Discount Percentage,Rating,Number OF Ratings,Battery Life (Days)
count,360.0,360.0,354.0,304.0,304.0,356.0,315.0,330.0
mean,222.702778,222.702778,12514.378531,14415.835526,47.955347,4.027528,10671.815873,14.18197
std,130.632905,130.632905,16914.978666,15613.457236,24.696899,0.556828,27575.956948,7.661878
min,0.0,0.0,1199.0,1669.0,-79.688436,1.0,1.0,0.75
25%,111.75,111.75,2126.0,5999.0,33.177427,3.9,55.0,8.0
50%,222.5,222.5,3999.0,7994.5,53.068408,4.1,830.0,17.5
75%,336.25,336.25,17367.25,17996.0,66.67778,4.3,7576.5,22.0
max,448.0,448.0,98990.0,96390.0,91.00455,5.0,275607.0,22.0


In [21]:
#checking corealtion of the data

train_df[numerical_data].corr()

Unnamed: 0.1,index,Unnamed: 0,Current Price,Original Price,Discount Percentage,Rating,Number OF Ratings,Battery Life (Days)
index,1.0,1.0,-0.100146,-0.084262,-0.123792,-0.247504,-0.187509,0.371223
Unnamed: 0,1.0,1.0,-0.100146,-0.084262,-0.123792,-0.247504,-0.187509,0.371223
Current Price,-0.100146,-0.100146,1.0,0.971497,-0.602668,0.397179,-0.195773,0.030352
Original Price,-0.084262,-0.084262,0.971497,1.0,-0.479494,0.325818,-0.18065,-0.132527
Discount Percentage,-0.123792,-0.123792,-0.602668,-0.479494,1.0,-0.191073,0.235221,-0.011792
Rating,-0.247504,-0.247504,0.397179,0.325818,-0.191073,1.0,0.08525,-0.102289
Number OF Ratings,-0.187509,-0.187509,-0.195773,-0.18065,0.235221,0.08525,1.0,0.009617
Battery Life (Days),0.371223,0.371223,0.030352,-0.132527,-0.011792,-0.102289,0.009617,1.0


In [22]:
#skeness of the data
train_df[numerical_data].skew()

index                  0.012368
Unnamed: 0             0.012368
Current Price          2.301689
Original Price         2.734265
Discount Percentage   -0.983632
Rating                -1.589043
Number OF Ratings      5.467630
Battery Life (Days)   -0.190298
dtype: float64

# Data Profiling

Data profiling is a crucial step in the data analysis process that involves examining and summarizing the characteristics of a dataset. It provides valuable insights into the data distribution, missing values, correlations, and other patterns, helping to understand the underlying structure and quality of the dataset.

## Using Pandas Profiling

We will utilize the `pandas_profiling` library to generate a comprehensive report on our dataset. This report will include various statistics, visualizations, and insights to facilitate our understanding of the data.

```python
import pandas_profiling

# Perform Pandas Profiling
profile = train_df.profile_report()

# Save the report to a file
profile.to_file("data_profile_report.html")


In [23]:
!pip install h5py
!pip install typing-extensions
!pip install wheel



In [None]:
!pip install pandas-profiling
!pip install ipywidgets



In [None]:
# import pandas_profiling
import ydata_profiling as profile

In [None]:
profile = train_df.profile_report()
# # profile.ProfileReport(train_df)

In [None]:
profile.to_file("data_profile_report.html")

# Data Visualization: Univariate and Multivariate Analysis

Data visualization is an essential tool for understanding the distribution and relationships within a dataset. In this section, we'll explore both univariate (analysis of single variables) and multivariate (analysis of relationships between multiple variables) visualization techniques.

## Univariate plots
- histograms
- density Plots
- box and whisker plots

## Multivariate plots
- coorelation matrix plot
- scatter plot matrix

  


In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
numerical_data.remove('index')

In [None]:
numerical_data.remove('Unnamed: 0')

In [None]:
numerical_data

In [None]:
fig , axes = plt.subplots(nrows=2 , ncols=3 , figsize=(14,10))

for i , coll in enumerate(numerical_data):
    row = i // 3
    col = i % 3
    ax = axes[row , col]
    sns.histplot(data=train_df[coll],ax=ax,kde=True)

plt.show

In [None]:
fig , axes = plt.subplots(nrows=2 , ncols=3 , figsize=(14,10))

for i , coll in enumerate(numerical_data):
    row = i // 3
    col = i % 3
    ax = axes[row , col]
    sns.boxplot(data=train_df[coll],ax=ax)
fig.tight_layout()
plt.show

In [None]:
sns.pairplot(train_df)
plt.show()

In [None]:
sns.heatmap(train_df.corr())
plt.show()

# Feature Engineering

Feature engineering is a crucial step in the machine learning pipeline that involves creating new features or transforming existing ones to enhance the performance of predictive models. In this section, we'll explore various techniques for feature engineering.

## Handling Missing Values

Dealing with missing values is essential as many machine learning algorithms cannot handle them. There are several strategies for handling missing values:

- **Imputation**: Replace missing values with a suitable statistic such as mean, median, or mode.
- **Deletion**: Remove rows or columns with missing values if they are insignificant or too many.




In [None]:
# lib importing
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
train_df.head()


In [None]:
df.shape , train_df.shape , test_df.shape

In [None]:
# removing unnamed and index
train_df.drop(['df_index' , 'Unnamed: 0'], axis = 1 , inplace=True)

In [None]:
#checking null values
train_df['Display Size'].isna().sum()

In [None]:
train_df['Display Size'].fillna('0.0 inches' , inplace = True)

In [None]:
train_df['Display Size'].isna().sum()

In [None]:
train_df['Display Size'].apply(lambda x : float(x.split()[0]))

In [None]:
# for weight data
train_df['Weight'].isna().sum()


In [None]:
train_df['Weight'].value_counts()

In [None]:
re.findall('\d+' , '20 - 35 g ')

In [None]:
cal = sum([int(x) for x in re.findall('\d+' , '20 - 35 g ')]) / 2
train_df['Weight'].replace ('20 - 35 g', cal , inplace = True)

In [None]:
cal = sum([int(x) for x in re.findall('\d+' , '35 - 50 g ')]) / 2
train_df['Weight'].replace ('35 - 50 g', cal , inplace = True)

In [None]:
cal = sum([int(x) for x in re.findall('\d+' , '50 - 75 g ')]) / 2
train_df['Weight'].replace ('50 - 75 g', cal , inplace = True)

In [None]:
train_df['Weight'].replace('75g +' , float(re.findall('\d+' , '75g')[0]) , inplace = True)

In [None]:
train_df['Weight'].replace('<= 20 g' , float(re.findall('\d+' , '<= 20 g')[0]) , inplace = True)

In [None]:
train_df['Weight'].value_counts()

In [None]:
train_df.info()

In [None]:
train_df.head()

In [None]:
train_df['Discount Price'] = (train_df['Original Price'] * (-train_df['Discount Percentage']))/100

In [None]:
train_df.drop(['Discount Percentage'], axis = 1 , inplace = True)

In [None]:
train_df.shape

## Numerical Data

In [None]:
numerical_col = [feature for feature in train_df.columns if train_df[feature].dtype == np.float64]


In [None]:
numerical_col

In [None]:
train_df[numerical_col].head()

In [None]:
fig , axes = plt.subplots(nrows=3 , ncols=3 , figsize=(14,10))

for i , coll in enumerate(numerical_col):
    row = i // 3
    col = i % 3
    ax = axes[row , col]
    sns.kdeplot(data=train_df[coll],ax=ax , fill=True)
fig.tight_layout()
plt.show

In [None]:
fig , axes = plt.subplots(nrows=3 , ncols=3 , figsize=(14,10))

for i , coll in enumerate(numerical_col):
    row = i // 3
    col = i % 3
    ax = axes[row , col]
    sns.boxplot(x=train_df[coll],ax=ax )
fig.tight_layout()
plt.show

In [None]:
def remove_outliers_IQR(data,col):
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bond = Q1 - 1.5 * IQR
    upper_bond = Q3 + 1.5 * IQR
    return data[(data[col] > lower_bond) & (data[col] < upper_bond)]

In [None]:
train_df.columns

In [None]:
train_df.shape

In [None]:
import_col = [ 'Current Price', 'Original Price', 'Rating', 'Number OF Ratings', 'Display Size' ]

In [None]:
for col in import_col:
    train_df = remove_outliers_IQR(train_df , col)

In [None]:
for col in numerical_col:
    print(col)
    train_df[col].fillna(train_df[col].median(), inplace = True)

In [None]:
train_df[numerical_col].isna().sum()

In [None]:
train_df.shape

In [None]:
train_df.isna().sum()

In [None]:
# sklearn
!pip install scikit-learn
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
data = scaler.fit_transform(train_df[numerical_col[: -1]])

In [None]:
data

In [None]:
data = pd.DataFrame(data,columns=numerical_col[:-1])

In [None]:
data.head()

In [None]:
data.shape , train_df.shape

In [None]:
train_df.drop(numerical_col[:-1], axis=1,inplace=True)

In [None]:
train_df.head()

In [None]:
train_df = pd.concat([train_df,data], axis=1)

In [None]:
train_df.head()

In [None]:
train_df[numerical_col]

## Categorial Data

In [None]:
# data_path = './data/smartwatches.csv'
# df = pd.read_csv(data_path)
# df.drop_duplicates(inplace=True)

In [None]:
# train_df = df.copy()

In [None]:
# df.shape , train_df.shape

In [None]:
train_df.shape

In [None]:
categorical_col = [feature for feature in train_df.columns if train_df[feature].dtype == 'object' ]

In [None]:
categorical_col

In [None]:
train_df [categorical_col].head()


In [None]:
train_df['Bluetooth'].value_counts()


In [None]:
categorical_col.remove('Bluetooth')


In [None]:
from scipy.stats import f_oneway


In [None]:

for col in categorical_col:
    CategoryGroupLists = train_df.groupby(col)['Discount Price'].apply(list)
    AnovaResults = f_oneway(*CategoryGroupLists)
    print(col, ': ', 'P-Value for anova is : ', AnovaResults[1])


In [None]:
imp_col = ['Brand', 'Model Name', 'Dial Shape', 'Strap Material']

In [None]:
train_df[imp_col].head()


In [None]:
train_df[imp_col].isna().sum()


In [None]:
for col in imp_col[1:]:
    train_df[col].fillna('other', inplace=True)

In [None]:
brand = pd.get_dummies(train_df['Brand'], drop_first=True)


In [None]:
model_name  = pd.get_dummies(train_df['Model Name']).drop(['other'], axis=1)
dial_shape = pd.get_dummies(train_df['Dial Shape']).drop(['other'], axis=1)
strap_material = pd.get_dummies(train_df['Strap Material']).drop(['other'], axis=1)

In [None]:
imp_df = pd.concat([brand, model_name, dial_shape, strap_material], axis=1)


In [None]:
imp_df.head()


In [None]:
train_df[numerical_col].isna().sum()


In [None]:
new_df = pd.concat([train_df[numerical_col], imp_df], axis=1)


In [None]:
new_df.head()


In [None]:
new_df.isna().sum()


In [None]:
new_df.to_csv('./data/clean.csv', index=False)

# Model Building

Model building is the process of training a machine learning model on the training data and evaluating its performance on the testing data. 




In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('Discount Percentage', axis=1)
y = df ['Discount Percentage']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.2 , random_state = 42)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lr_model = LinearRegression()
lr_model.fit(X_train , y_train)
y_pred = lr_model.predict(X_test)

res = r2_score(y_test , y_pred)
print(res)


In [None]:
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train , y_train)
y_pred = dt_model.predict(X_test)
res = r2_score(y_test , y_pred)
print(res)

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rf_model.fit(X_train , y_train)
y_pred = rf_model.predict(X_test)
res = r2_score(y_test , y_pred)
print(rest)

In [None]:
import xgboost as xgb

model = xgb.XGBRegressor()
model.fit(X_train , y_train)
y_pred = model.predict(X_test)
res = r2_score(y_test , y_pred)
print(res)


In [None]:
# cross validation

In [None]:
from sklearn.model_selection import cross_val