In [45]:
#Library import
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from pprint import pprint

Problem Formulation: The goal is to predict the product ratings given the other features known for a product on Wish.com. The input is features of the products(prices,buyer, units sold etc). The output is a rating score of the product ranging from 1 to 5. A classfication & prediction data-mining function is required. One challenge is that the data might be noisy and incomplete, therefore some data cleansing method should be applied to fiil in the blanks and pruning the unwanted features. Another challenge is to select model paramters, so the model has to be tested several times for different paramter settings.Additionally, some not important features could be reduced, but some method needed to be applied to find them. An ideal solution would mean to construct a model that selection and has 0 train and test error after training, which means that it outputs the exactly rating for the product for its inputs.

In [46]:
data = pd.read_csv('train_new.csv').sample(frac=1) #shuffle
#find ratings in integer labels
data = data.loc[data['rating'].isin([1, 2, 3, 4, 5])]
data = data.fillna(0)#fill missing value
#cut features that shouldn't count towards the result
data = data.drop(['merchant_id', 'merchant_profile_picture', 'id', 'tags'], axis=1)

In [47]:
#train-validation data split
msk = np.random.rand(len(data)) < 0.7
tr = data[msk]
val = data[~msk]

In [48]:
#categorical data encoding
dict_cat = {}


# columns that are of categorical value
cat_cols = tr.columns[tr.dtypes==object].to_list()



def cat_digit(col):  
    # build the mapping
    encoded = col.astype('category').cat.codes
    # store the mapping
    dict_cat[col.name] = dict(zip(np.asarray(col), np.asarray(encoded)))
    return encoded

# for each categorical feature, apply cat_digit where we build the mapping and transform the data
# this is for the training set (where we build the mapping)
tr[cat_cols] = tr[cat_cols].apply(lambda col: cat_digit(col))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [49]:
# missing value handling 
# then we will use the mappings built from the training set, to transform the validation set
val[cat_cols] = val[cat_cols].apply(lambda col: col.map(dict_cat[col.name]))
# for string values that not seen in training set, we replace it with -1
val = val.fillna(-1)

Train-test data is already splitted in the dataset provided

In [50]:
test_data = pd.read_csv('test_new.csv').sample(frac=1) 
_id = test_data['id']
test_data = test_data.fillna(0)
#preprocessing feature selection
test_data = test_data.drop(['merchant_id', 'merchant_profile_picture', 'id', 'tags'], axis=1)

test_data[cat_cols] = test_data[cat_cols].apply(lambda col: col.map(dict_cat[col.name]))
# again, not-seen string value filled with -1
test_data = test_data.fillna(-1)

What is the experimental protocol used and how was it carried out? What preprocessing steps are used? <br>
The purporse of the experiment is to predict the product ratings given the other features known for a product on Wish.com.<br>
The materials are pre-splitted datasets train_new.csv and test_new.csv. The evironment is google colab notebook. <br>
Methodology:<br>
Decision Tree,SVM and Naive bayes Model will be used to analyze and evaluate the data.<br>
First, do some preprocessing with the original data(done in the walkthrough notebook).<br>
Then, use the default setting without any parameter tuning or extra data preprocessing in the walkthrough notebook to set up the model. Train the model with untuned training data and generate validation score using validation data.<br>
After that, tune the hyperparameter or do extra data preprocessing and repeat the process and see if there is improvements in validation score. At least 5 tunes should be done.<br>
<br>
The data preprocessing steps used here is missing value handling, categorical data encode and dataset split. For Decision tree and Naive bayes, no feature scaling method is used. For SVM the features are scaled.<br>


#Decision tree

Attempt1: default configuration

In [51]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [52]:
tr_y = tr['rating']
#training data target values
tr_x = tr.drop('rating', axis=1)
#training data
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(tr_x,tr_y)



In [53]:
#validation data split
val_y = val['rating']
val_x = val.drop('rating', axis=1)
pred_val = clf.predict(val_x)

In [54]:
#Prediction
val_score = f1_score(val_y, pred_val, average='micro')
print(val_score)

0.5255255255255256


Attempt1: Feature reduction for low importance features, changing quality measuring method

In [55]:
#get the feature importance form previous attempt
feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))

feat importance = [0.05396051 0.02187632 0.         0.00746259 0.00584514 0.0964197
 0.         0.00443714 0.03638079 0.         0.01997378 0.00131579
 0.00131579 0.00423379 0.00131579 0.         0.0306669  0.
 0.00307018 0.00811404 0.00470648 0.02472375 0.03312997 0.01482469
 0.02186171 0.05771063 0.00027296 0.         0.        ]


In [56]:
#construct dataframe for the feature importance
df=pd.DataFrame({'Feature_names':tr_x.columns,'Importances':feat_importance})
df.sort_values(by='Importances',ascending=False)

Unnamed: 0,Feature_names,Importances
5,rating_count,0.09642
25,merchant_rating,0.057711
0,price,0.053961
8,badge_product_quality,0.036381
22,merchant_name,0.03313
16,countries_shipped_to,0.030667
21,merchant_title,0.024724
1,retail_price,0.021876
24,merchant_rating_count,0.021862
10,product_color,0.019974


In [57]:
#drop the values that low importance
drop_values=df.loc[df['Importances']<0.001]

In [58]:
#make it into a list
drop=drop_values['Feature_names']
d=drop.tolist()

In [59]:
tr_y_1 = tr['rating']
#training data target values
tr_x_1 = tr.drop('rating', axis=1)
#drop non-important features
tr_x_1=tr_x_1.drop(d,axis=1)
#training data
# Create Decision Tree classifer object
clf_1 = DecisionTreeClassifier(criterion='entropy',max_depth=6)

# Train Decision Tree Classifer
clf_1 = clf_1.fit(tr_x_1,tr_y_1)



In [60]:
#validation x feature dropping
val_x_1 = val.drop('rating', axis=1)
#drop non important features
val_x_1=val_x.drop(d,axis=1)
#prediction
pred_val_1 = clf_1.predict(val_x_1)

In [61]:
#generate validation score
val_score = f1_score(val_y, pred_val_1, average='micro')
print(val_score)

0.6726726726726727


In [62]:
# once you are happy with your local model, let's prepare a submission
# we need to apply the same preprocessing steps on the testing set as you did before you train the model
test_data_1 = test_data.drop(d, axis=1)

Attempt3: Prepruning by finding the most suitable maximum depth and maximum feature
(inherit the reducted features in attempt2)

In [63]:
# use the gridsearchCV module to find the best parameter set
from sklearn.model_selection import GridSearchCV
#try different parameters
parameters={'max_depth':[2,4,6,8,10,12,14,16,18,20],'max_features':[2,4,6,8,10,12,14,16,18,20,22,24,25,26]}
#construct decsiontree
default_clf=DecisionTreeClassifier()
#try tuning the decision tree using grid search
grid_search=GridSearchCV(estimator=default_clf,param_grid=parameters)
grid_search.fit(tr_x_1,tr_y_1)
#DecisionTreeClassifier().get_params().keys()

200 fits failed out of a total of 700.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
200 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 308, in fit
    raise ValueError("max_features must be in (0, n_features]")
ValueError: max_features must be in (0, n_features]

 0.75921053 0.76315789 0.76578947 0.77236842        nan        nan
        nan        n

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
                         'max_features': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20,
                                          22, 24, 25, 26]})

In [64]:
#find the best parameter set 
model=grid_search.best_estimator_
#training
model.fit(tr_x_1,tr_y_1)
#it turns out the best maximum depth is =2, features=18

DecisionTreeClassifier(max_depth=4, max_features=20)

In [65]:
#prediction
pred_val_2 = model.predict(val_x_1)

In [66]:
#find f1_score
val_score = f1_score(val_y, pred_val_2, average='micro')
print(val_score)

0.7477477477477479


In [67]:

pred_test = model.predict(test_data_1)#predict from test data
pred_df = pd.DataFrame(data={'id': np.asarray(_id), 'rating': pred_test})
#generate csv file
pred_df.to_csv('pred_walkthrough.csv', index=False)

# SVM

In [68]:
#feature scaling for svm
from sklearn.preprocessing import StandardScaler
# Initialise the Scaler
scaler = StandardScaler()
 
tr_svm=tr.copy()
val_svm=val.copy()
#normalization
scaler.fit(tr_svm)
scaler.fit(val_svm)
#x-y split
tr_y_svm = tr_svm['rating']
tr_x_svm = tr_svm.drop('rating', axis=1)
val_y_svm = val_svm['rating']
val_x_svm = val_svm.drop('rating', axis=1)

Attempt1: default configuration

In [69]:
#Import svm model
from sklearn import svm

#Default svm rbf classifier
SVM_clf = svm.SVC() 

#Model training
SVM_clf.fit(tr_x_svm, tr_y_svm)

SVC()

In [70]:
#prediction
pred_val_svm_1 = SVM_clf.predict(val_x_svm)

In [71]:
val_score = f1_score(val_y_svm, pred_val_svm_1, average='micro')
print(val_score)

0.7027027027027027


Attempt2: 
For the SVM model in this case, a linear model would be bad since there are too much features in training data. Thus, feature selection is not good for this case too since I decided to use a non linear SVM model. For similar reason, polynomial kernel is also bad.



Therefore, in this attempt, I will try sigmoid kenrel with gamma=0.002, and see if tuning the kernel and gamma has any effect on the result.

In [72]:
#Default svm linear classifier
SVM_clf_2 = svm.SVC(kernel='sigmoid',gamma=0.002) 


#Model training
SVM_clf_2.fit(tr_x_svm, tr_y_svm)


SVC(gamma=0.002, kernel='sigmoid')

In [73]:
#prediction
pred_val_svm_2 = SVM_clf_2.predict(val_x_svm)

In [74]:
#find f1_score
val_score = f1_score(val_y_svm, pred_val_svm_2, average='micro')
print(val_score)
#no significant change from before

0.7027027027027027


Attempt3:  This time, try the rbf kernel with balanced classweight and a larger gamma. It could improve the f1 score since there is a way to adjust the class weight now.

In [75]:
#Default svm linear classifier
SVM_clf_3 = svm.SVC(gamma=0.1,class_weight='balanced') 


#Model training
SVM_clf_3.fit(tr_x_svm, tr_y_svm)


SVC(class_weight='balanced', gamma=0.1)

In [76]:
#predictions
pred_val_svm_3 = SVM_clf_3.predict(val_x_svm)

In [77]:
val_score = f1_score(val_y_svm, pred_val_svm_3, average='micro')
print(val_score)
#shows improvement in validation score

0.7447447447447447


In [78]:
#Default svm linear classifier
SVM_clf_3_1 = svm.SVC(gamma=0.31,class_weight='balanced') 
#try to play with different gamma settings, 
#there starts to be reductions in f1_score  starts at gamma >0.3


#Model training
SVM_clf_3_1.fit(tr_x_svm, tr_y_svm)
#Predictions
pred_val_svm_3_1 = SVM_clf_3_1.predict(val_x_svm)
val_score = f1_score(val_y_svm, pred_val_svm_3_1, average='micro')
print(val_score)

0.7417417417417418


#Naive-Bayes
Attempt1: default

In [79]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Model training using original data
gnb.fit(tr_x, tr_y)

#Prediction
pred_val_naivebayes = gnb.predict(val_x)
val_score = f1_score(val_y, pred_val_naivebayes, average='micro')
print(val_score)

0.5165165165165165


Attemp2: Feature selection

For Naive-bayes model, hyper paramter tuning is not gonna improve the metric greatly. A feature selection may be much more plausible. In this case, the correlated features will be removed.
I would expect the scores to improve.

In [80]:
correlated_features =set()#empty set to put the features in 
correlation_matrix = tr_x.corr()#generate the correlation matrix

In [81]:
column_range=len(correlation_matrix .columns)
#get the correlated features
for x in range(column_range):#column
    for y in range(x):#row
        corr_=abs(correlation_matrix.iloc[x, y])#correlation socre
        if abs(correlation_matrix.iloc[x, y]) > 0.8:#the feature is correlated if the correlation score>0.8
            feature = correlation_matrix.columns[x]#record the feature
            correlated_features.add(feature)

In [82]:
#correlated features as shown
print(correlated_features)

{'badge_product_quality', 'urgency_text', 'rating_count', 'shipping_option_price'}


In [83]:
#drop the correlated features
tr_x_bayes=tr_x.drop(correlated_features,axis=1)

In [84]:
#Create Naive bayes model
gnb = GaussianNB()

#Model Training
gnb.fit(tr_x_bayes, tr_y)

#Prediction
val_x_bayes=val_x.drop(correlated_features,axis=1)
pred_val_naivebayes_1 = gnb.predict(val_x_bayes)
val_score = f1_score(val_y, pred_val_naivebayes_1, average='micro')
print(val_score)

#show improvements in validation scores

0.5465465465465466


The final prediction submission used the result from Attempt3 in the Decision Tree model section.

Questions：<br>
🌈Q:Why Data Mining is a misnomer? What is another preferred name?<br>
A: Because the goal of 'Data mining' is to analyze the patterns and extract knowledge of existing data, instead of extracting the data itself. A more prefered name would be Knowledge Discovery in Data (KDD).<br>
🌈 What is the general knowledge discovery process? What is the difference between a data engineer and data scientist/AI engineer?<br>
A: The general knowledge discovery process is defined as he nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. In the Datascience hierachy of needs, Data engineers handles the movements/storation of data and Data scientists explore,aggregate/label and optimize the data. A more specific description would be, Data engineers collects the data and transform the data into "pipelines" for the Data scientists to analyze,test, aggregate and optimize.<br>
🌈 In data mining, what is the difference between prediction and categorization?<br>
A:Prediction is using exisiting data to identify a missing or unavailable value.Categorization doesn't generate new numercial values but uses the similarities between exsiting data points to determine which class they belong to(identify their labels).<br>
🌈 Why data science/machine learning is a bad idea in the context of information security?<br>
A: Data science/machine learning relys on using large amounts of existing data for model training. It leaves spaces for information leak and recovering original private data through reverse engineering. <br>
🌈 What is CIA principle and how can we use it to access the security/privacy aspect of the AI system/pipelines?<br>
A: The CIA principal is C(confientiality),I(integrity) and A(Availability).Confidentiality is another name for privacy. Integrity stands for maintaining the consistency, accuracy and trustworthiness of data over its entire lifecycle.Availability stands for that information should be consistently and readily accessible for authorized parties. The privacy/security aspect of the AI system could be examined using this principal with the following questions:<br>
&emsp;Confidentiality: Under what circumstance, the disclosure of personal or confidential information would happen to the system.<br>
&emsp;Integrity: Under what circumstance, the data would be corrupted or the AI system will be executed wrongly?<br>
&emsp;Availability:Under what circumstance, the AI system will not provide any service unintended?<br>