# Engima Day 1 - Classification

The elite 10 Luxury Conglomerates reported USD 144 billion in revenue from luxury good in 2019 FY, and the luxury goods market has been on a perpetual upward trend. Few can afford these luxury products, and even fewer can manage to produce these high-end products. Owning these products have always been prestigious and are the epitome of desirability.

The Mafia has been selling exact replicas of various high-end products such premium sneakers, exquisite jackets, swiss watches, stylish handbags, haute-couture clothes etc. The Mafia has continuously earned hefty profits from this operation and virtually causing loss worth billions to these companies. They further use this money to fund illegal operations all over Asia. You are part of the Asian Federal Trade Control's (AFTC) Bureau of Consumer Protection.

You have been assigned to scrutinize and detail to gather intel about the Mafia's operations. The counterfeit products are incredibly identical to the original ones, and the Mafia would do anything in its power to keep selling them.

The given data contains certain attributes you need to determine which of the products are counterfeit to track down the chain of operation.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('Classification_Data.csv')

In [3]:
data.head()

Unnamed: 0,Durability %,Hand/Factory Made,MSRP,Box Volume (cm^3),Nature of Payment,No. of Sales,Asia or Not,Port,Factory,Counterfeit
0,0.312381,Handmade,5196,18785.704912,PrePaid,10211,Asia,Port 0,Factory 0,0
1,0.512374,Handmade,5318,17301.908573,PrePaid,5185,Asia,Port 0,Factory 2,1
2,0.752365,Factorymade,6563,9250.611338,COD,5579,Asia,Port 2,Factory 0,1
3,0.592371,Factorymade,5318,12618.315418,COD,6036,Asia,Port 1,Factory 1,0
4,0.712367,Factorymade,3553,9864.706939,COD,6051,Asia,Port 1,Factory 0,1


In [4]:
data.columns

Index(['Durability %', 'Hand/Factory Made', 'MSRP', 'Box Volume (cm^3)',
       'Nature of Payment', 'No. of Sales', 'Asia or Not', 'Port ', 'Factory ',
       'Counterfeit'],
      dtype='object')

In [5]:
data.dtypes

Durability %         float64
Hand/Factory Made     object
MSRP                   int64
Box Volume (cm^3)    float64
Nature of Payment     object
No. of Sales           int64
Asia or Not           object
Port                  object
Factory               object
Counterfeit            int64
dtype: object

In [6]:
# checking for missing/ null values
data.isnull().sum()

Durability %         0
Hand/Factory Made    0
MSRP                 0
Box Volume (cm^3)    0
Nature of Payment    0
No. of Sales         0
Asia or Not          0
Port                 0
Factory              0
Counterfeit          0
dtype: int64

In [7]:
# splitting the independent and dependent variables
y = data.Counterfeit
X = data.drop(['Counterfeit'], axis=1)

In [8]:
# splitting the dataset as per the instructions given
X_train_full = X[:2000]
X_valid_full = X[2000:]

y_train = y[:2000]
y_valid = y[2000:]

In [9]:
# selecting the categorical columns with relatively low cardinality
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and  X_train_full[cname].dtype == "object"]

In [10]:
# selecting the numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

In [11]:
# keeping the selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

# preprocessing for numerical data
numerical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy='constant')),
    ('scaler', RobustScaler())
])

# preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
      ])

In [13]:
from sklearn.ensemble import RandomForestClassifier

# defining the model
rfc = RandomForestClassifier(max_depth=10, random_state =7, n_estimators =93)

In [14]:
# bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('rfc', rfc)
                             ])

# preprocessing of training data, fitting the model 
my_pipeline.fit(X_train, y_train)

# preprocessing of validation data, evaluating the model
score = my_pipeline.score(X_valid, y_valid)

print('Score: ', score)

Score:  0.7310344827586207


Final Score: 73.10% Accuracy