# Predicting Product Item

1. Data Collection
2. Define Problem Statement
3. EDA
4. Data Preparation
5. Selecting and Training ML models
6. Hyperparameter Tuning
7. Deploy the Model using a web service

### Step 1: Data Collection

In [87]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [88]:
# reading the .csv file using pandas
# ["Item ID", "Branch", "Customer type", "Age", "Unit price", "Date", "Rating"]

df = pd.read_csv('data.csv',skipinitialspace=True)

data = df.copy()

In [89]:
data.sample(10)

Unnamed: 0,Item ID,Branch,Customer type,Gender,Age,Unit price,Rating
603,5,B,0,1,2,45,7.8
987,6,B,1,0,1,10,6.2
784,5,C,1,1,0,45,4.1
202,2,C,0,0,3,20,9.8
409,3,C,0,1,1,35,5.4
734,5,B,1,0,0,45,6.4
802,5,C,1,0,1,45,6.9
633,5,B,0,0,2,45,4.7
5,1,C,0,0,1,25,4.1
646,5,C,0,0,1,45,7.4


### Step 2: Define Problem Statement

Our aim here is to **predict the Product Item** given we have other attributes of that Product.

### Step 3: Exploratory Data Analysis

1. Check for Data type of columns
2. Check for null values.
3. Check for outliers
4. Look for the category distribution in categorical columns
5. Plot for correlation
6. Look for new variables

In [90]:
##checking the data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 7 columns):
Item ID          999 non-null int64
Branch           999 non-null object
Customer type    999 non-null int64
Gender           999 non-null int64
Age              999 non-null int64
Unit price       999 non-null int64
Rating           999 non-null float64
dtypes: float64(1), int64(5), object(1)
memory usage: 54.7+ KB


In [91]:
##checking for all the null values
data.isnull().sum()

Item ID          0
Branch           0
Customer type    0
Gender           0
Age              0
Unit price       0
Rating           0
dtype: int64

In [92]:
##summary statistics of quantitative variables
data.describe()

Unnamed: 0,Item ID,Customer type,Gender,Age,Unit price,Rating
count,999.0,999.0,999.0,999.0,999.0,999.0
mean,3.778779,0.500501,0.500501,1.372372,34.454454,6.973073
std,1.45683,0.50025,0.50025,0.888288,14.408083,1.719401
min,1.0,0.0,0.0,0.0,10.0,4.0
25%,3.0,0.0,0.0,1.0,20.0,5.5
50%,4.0,1.0,1.0,1.0,35.0,7.0
75%,5.0,1.0,1.0,2.0,45.0,8.5
max,6.0,1.0,1.0,3.0,55.0,10.0


In [93]:
##category distribution
data["Unit price"].value_counts() / len(data)

45    0.283283
35    0.215215
20    0.189189
55    0.145145
10    0.118118
25    0.049049
Name: Unit price, dtype: float64

In [94]:
data['Age'].value_counts()

1    526
2    196
3    151
0    126
Name: Age, dtype: int64

In [95]:
##pairplots to get an intuition of potential correlations
# sns.pairplot(data[["Item ID", "Branch", "Customer type", "Age", "Unit price"]], diag_kind="kde")


# ==========================================================

### Setting aside Test Set

In [96]:
# set aside the test data
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

test_set.shape

(200, 7)

In [None]:
train_set['Unit price'].value_counts() / len(train_set)

45    0.286608
35    0.210263
20    0.192741
55    0.141427
10    0.116395
25    0.052566
Name: Unit price, dtype: float64

In [None]:
test_set["Unit price"].value_counts() / len(test_set)

### Stratified Sampling

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Customer type"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [None]:
strat_test_set.shape

In [None]:
##checking for cylinder category distribution in training set
strat_train_set['Customer type'].value_counts() / len(strat_train_set)

In [None]:
##checking for cylinder category distribution in testing set
strat_test_set["Customer type"].value_counts() / len(strat_test_set)

In [None]:
##converting integer classes to countries in Origin column
train_set['Branch'] = train_set['Branch'].map({1: 'A', 2: 'B', 3 : 'C'})
train_set.sample(10)


In [None]:
##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
train_set.head()

In [None]:
data = strat_train_set.copy()

### Checking correlation matrix w.r.t. MPG

In [None]:
corr_matrix = data.corr()
corr_matrix['Item ID'].sort_values(ascending=False)

### Testing new variables by checking their correlation w.r.t. MPG

1. Displacement on Power
2. Weight on cylinder
3. Acceleration on power
4. Acceleration on cylinder

In [None]:
## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']

corr_matrix = data.corr()
corr_matrix['Item ID'].sort_values(ascending=False)


## Data Preparation

1. Handling Categorical Functions - OneHotEncoder
2. Data Cleaning - Imputer
3. Attribute Addition - Adding custom transformation
4. Setting up Data Transformation Pipeline for numerical and categorical column.

In [None]:
##handling missing values
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(data)

In [None]:
imputer.statistics_

In [None]:
data.median().values

In [None]:
X = imputer.transform(data)

In [None]:
data_tr = pd.DataFrame(X, columns=data.columns,
                          index=data.index)