# Homework 3
In this assignment, we will be building a Naïve Bayes classifier and a SVM model for the productivity satisfaction of [the given dataset](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees), the productivity of garment employees.

## Background
The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. 

## Dataset Attribute Information

1. **date**: Date in MM-DD-YYYY
2. **day**: Day of the Week
3. **quarter** : A portion of the month. A month was divided into four quarters
4. **department** : Associated department with the instance
5. **team_no** : Associated team number with the instance
6. **no_of_workers** : Number of workers in each team
7. **no_of_style_change** : Number of changes in the style of a particular product
8. **targeted_productivity** : Targeted productivity set by the Authority for each team for each day.
9. **smv** : Standard Minute Value, it is the allocated time for a task
10. **wip** : Work in progress. Includes the number of unfinished items for products
11. **over_time** : Represents the amount of overtime by each team in minutes
12. **incentive** : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
13. **idle_time** : The amount of time when the production was interrupted due to several reasons
14. **idle_men** : The number of workers who were idle due to production interruption
15. **actual_productivity** : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

### Libraries that can be used: numpy, scipy, pandas, scikit-learn, cvxpy, imbalanced-learn
Any libraries used in the discussion materials are also allowed.

#### Other Notes

 - Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment. <br >
 - If not specified, you are not required to do hyperparameter tuning, but feel free to do so if you'd like.

#### Trouble Shooting
In case you have trouble installing and using imbalanced-learn(imblearn) <br >
Run the below code cell, then go to the selection bar at top: Kernel > Restart. <br >
Then try `import imblearn` to see if things work. 

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install imbalanced-learn delayed



# Exercises

## Exercise 1 - General Data Preprocessing (20 points)

Our dataset needs cleaning before building any models. Some of the cleaning tasks are common in general, but depends on what kind of models we are building, sometimes we have to do additional processing. These additional tasks will be mentioned in each of the remaining two exercises later.

Note that **we will be using this processed data from exercise 1 in each of the remaining two exercises**.

For convenience, here are the attributes that we would treat them as **categorical attributes**: `day`, `quarter`, `department`, and `team`. 

 - Drop the column `date`.
 - For each of the categorical attributes, **print out** all the unique elements.
 - For each of the categorical attributes, remap the duplicated items, if you find there are typos or spaces among the duplicated items.
     - For example, "a" and "a " should be the same, so we need to update "a " to be "a".
     - Another example, "apple" and "appel" should be the same, so you should update "appel" to be "apple".
     

 - Create another column named `satisfied` that records the productivity performance. The behavior defined as follows. **This is the dependent variable we'd like to classify in this assignment.**
     - Return True or 1 if `actual_productivity` is equal to or greater than `targeted_productivity`. Otherwise, return False or 0, which means the team fails to meet the expected performance.
 - Drop the columns `actual_productivity` and `targeted_productivity`.


 - Find and **print out** which columns/attributes that have empty vaules, e.g., NA, NaN, null, None.
 - Fill the empty values with 0.


In [1]:
import pandas as pd
import numpy as np
print('By Ofek Marom UCD')

# Step 1: reading csv file and at a same time using skipinitial attribute which will remove extra space
df = pd.read_csv('garments_worker_productivity.csv',skipinitialspace = True)

categorical_attributes = ['quarter','department','day','team']

# Step 2: convert all text to low case
df = df.astype(str).apply(lambda x: x.str.lower())
#display(df.head())
#  Step 3: remove extra whitespace
print('For each of the categorical attributes, remove extra whitespace')
for col in categorical_attributes:
    df[col] = df[col].str.strip()

# Step 4: Drop the column date
print('Drop the column date')
del df['date']
#display(df.head())
# Step 5: drop duplicates
print('drop duplicates')
df = df.drop_duplicates() # drop all duplicate records
#display(df.head())
# Step 6: print unique categorical_attributes
print('For each of the categorical attributes, **print out** all the unique elements')
df['quarter']
for col in categorical_attributes:
    print(df[col].unique())
    
    
#Step 7: Create another column named `satisfied` that records the productivity performanc
print('Create another column named `satisfied` that records the productivity performance')
# create a list of our conditions
conditions = [
    (df['targeted_productivity'] > df['actual_productivity'] ),
    (df['targeted_productivity'] <= df['actual_productivity'] )
    ]
# create a list of the values we want to assign for each condition
values = [0,1]
# create a new column and use np.select to assign values to it using our lists as arguments
df['satisfied'] = np.select(conditions, values)

#Drop the columns actual_productivity and targeted_productivity.
del df['actual_productivity']
del df['targeted_productivity']

print('Print all null in columns/attributes')
No_Data_symbol_list = ['na','nan','null','none'] #generate list for all symbol represent no data
for col in df:
    print('search nulls in ',col)
    for index in range(len(df[col])):
      if df[col][index] in No_Data_symbol_list: # check if symbol in df[col][index] contain no data symbol 
        print('   * null was found in ',col,' row = ',index,' and will modify with 0')

# Step 8: fill and replace all NA, NaN, null, None with 0
print("fill and replace all NA, NaN, null, None with 0")
df = df.replace('na', 0)
df = df.replace('nan', 0)
df = df.replace('null', 0)
df = df.replace('none', 0)
df = df.fillna(0)    

display(df.head(5))

By Ofek Marom UCD
For each of the categorical attributes, remove extra whitespace
Drop the column date
drop duplicates
For each of the categorical attributes, **print out** all the unique elements
['quarter1' 'quarter2' 'quarter3' 'quarter4' 'quarter5']
['sweing' 'finishing']
['thursday' 'saturday' 'sunday' 'monday' 'tuesday' 'wednesday']
['8' '1' '11' '12' '6' '7' '2' '3' '9' '10' '5' '4']
Create another column named `satisfied` that records the productivity performance
Print all null in columns/attributes
search nulls in  quarter
search nulls in  department
search nulls in  day
search nulls in  team
search nulls in  smv
search nulls in  wip
   * null was found in  wip  row =  1  and will modify with 0
   * null was found in  wip  row =  6  and will modify with 0
   * null was found in  wip  row =  13  and will modify with 0
   * null was found in  wip  row =  14  and will modify with 0
   * null was found in  wip  row =  15  and will modify with 0
   * null was found in  wip  row =  

   * null was found in  wip  row =  261  and will modify with 0
   * null was found in  wip  row =  262  and will modify with 0
   * null was found in  wip  row =  263  and will modify with 0
   * null was found in  wip  row =  271  and will modify with 0
   * null was found in  wip  row =  274  and will modify with 0
   * null was found in  wip  row =  278  and will modify with 0
   * null was found in  wip  row =  279  and will modify with 0
   * null was found in  wip  row =  280  and will modify with 0
   * null was found in  wip  row =  281  and will modify with 0
   * null was found in  wip  row =  282  and will modify with 0
   * null was found in  wip  row =  291  and will modify with 0
   * null was found in  wip  row =  292  and will modify with 0
   * null was found in  wip  row =  296  and will modify with 0
   * null was found in  wip  row =  298  and will modify with 0
   * null was found in  wip  row =  299  and will modify with 0
   * null was found in  wip  row =  300 

   * null was found in  wip  row =  914  and will modify with 0
   * null was found in  wip  row =  918  and will modify with 0
   * null was found in  wip  row =  920  and will modify with 0
   * null was found in  wip  row =  923  and will modify with 0
   * null was found in  wip  row =  924  and will modify with 0
   * null was found in  wip  row =  925  and will modify with 0
   * null was found in  wip  row =  926  and will modify with 0
   * null was found in  wip  row =  936  and will modify with 0
   * null was found in  wip  row =  937  and will modify with 0
   * null was found in  wip  row =  940  and will modify with 0
   * null was found in  wip  row =  941  and will modify with 0
   * null was found in  wip  row =  942  and will modify with 0
   * null was found in  wip  row =  943  and will modify with 0
   * null was found in  wip  row =  944  and will modify with 0
   * null was found in  wip  row =  948  and will modify with 0
   * null was found in  wip  row =  955 

search nulls in  no_of_style_change
search nulls in  no_of_workers
search nulls in  satisfied
fill and replace all NA, NaN, null, None with 0


Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,quarter1,sweing,thursday,8,26.16,1108.0,7080,98,0.0,0,0,59.0,1
1,quarter1,finishing,thursday,1,3.94,0.0,960,0,0.0,0,0,8.0,1
2,quarter1,sweing,thursday,11,11.41,968.0,3660,50,0.0,0,0,30.5,1
3,quarter1,sweing,thursday,12,11.41,968.0,3660,50,0.0,0,0,30.5,1
4,quarter1,sweing,thursday,6,25.9,1170.0,1920,50,0.0,0,0,56.0,1


## Exercise 2 - Naïve Bayes Classifier (40 points in total)

### Exercise 2.1 - Additional Data Preprocessing (10 points)

To build a Naïve Bayes Classifier, we need to further encode our categorical variables.

 - For each of the **categorical attribtues**, encode the set of categories to be **0 ~ (n_classes - 1)**.
     - For example, \["paris", "paris", "tokyo", "amsterdam"\] should be encoded as \[1, 1, 2, 0\].
     - Note that the order does not really matter, i.e., \[0, 0, 1, 2\] also works. But you have to start with 0 in your encodings.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.

In [10]:
#For each of the categorical attribtues, encode the set of categories to be 0 ~ (n_classes - 1).
#### OrdinalEncoding ====
from sklearn.model_selection import train_test_split

# Manual Mapping = 'team' will get special treatment => (team = team - 1)
remap = {
    'quarter': {'quarter1': 0,'quarter2': 1, 'quarter3': 2, 'quarter4': 3,'quarter5':4},
    'department': {'sweing': 0,'finishing': 1},
    'day': {'monday': 0, 'tuesday': 1, 'wednesday': 2, 'thursday': 3,'friday':4,'saturday':5,'sunday':6},
}

df1 = df.replace(remap) # encode all categorical attribtues
print('remape atributes: quarter,department and day')
# -- special treatment to team -> we will reduce 1 from all team data so (1-12) range will be transfer to => (0-11)
df1['team'] = df['team'].astype(int) - 1
print('remape atributes team: team = team - 1, so team will start from 0 to 11')

print('Data prior encoding')
display(df.head())
print('Data post encoding')
display(df1.head())

remape atributes: quarter,department and day
remape atributes team: team = team - 1, so team will start from 0 to 11
Data prior encoding


Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,quarter1,sweing,thursday,8,26.16,1108.0,7080,98,0.0,0,0,59.0,1
1,quarter1,finishing,thursday,1,3.94,0.0,960,0,0.0,0,0,8.0,1
2,quarter1,sweing,thursday,11,11.41,968.0,3660,50,0.0,0,0,30.5,1
3,quarter1,sweing,thursday,12,11.41,968.0,3660,50,0.0,0,0,30.5,1
4,quarter1,sweing,thursday,6,25.9,1170.0,1920,50,0.0,0,0,56.0,1


Data post encoding


Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,0,0,3,7,26.16,1108.0,7080,98,0.0,0,0,59.0,1
1,0,1,3,0,3.94,0.0,960,0,0.0,0,0,8.0,1
2,0,0,3,10,11.41,968.0,3660,50,0.0,0,0,30.5,1
3,0,0,3,11,11.41,968.0,3660,50,0.0,0,0,30.5,1
4,0,0,3,5,25.9,1170.0,1920,50,0.0,0,0,56.0,1


In [11]:
#Split the data into training and testing set with the ratio of 80:20.
train, test = train_test_split(df1, test_size=0.2, random_state=21)

#--- split data 80/20 ---
print('--- split data 80/20 ---')
X_train_org, y_train = train.copy().drop(columns=['satisfied']), train['satisfied']
X_test_org, y_test = test.copy().drop(columns=['satisfied']), test['satisfied']
y_train.value_counts()



--- split data 80/20 ---


1    703
0    254
Name: satisfied, dtype: int64

### Exercise 2.2 - Naïve Bayes Classifier for Categorical Attributes (15 points)

Use the categorical attributes **only**, please build a Categorical Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

In [29]:
# Remember to do this task with your processed data from Exercise 2.1
from sklearn.metrics import f1_score, classification_report
# Step 1:Keep categorical attributes only
X_train = X_train_org.copy() #clone X_train_org
X_test  = X_test_org.copy()  #clone X_test_org
for col in X_train:
   if not col  in categorical_attributes:
    del X_train[col]
for col in X_test:
   if not col  in categorical_attributes:
    del X_test[col]
print("--- Train ----")
display(X_train.head())
display(y_train.head())
print("--- Test ----")
display(X_test.head())
display(y_test.head())




--- Train ----


Unnamed: 0,quarter,department,day,team
961,3,0,3,0
467,3,0,1,0
1048,0,1,2,7
1179,1,0,2,2
710,1,0,1,7


961     1
467     1
1048    1
1179    1
710     0
Name: satisfied, dtype: int32

--- Test ----


Unnamed: 0,quarter,department,day,team
230,1,0,1,2
550,0,0,6,9
824,2,1,2,5
394,3,1,5,9
213,1,1,0,10


230    0
550    1
824    1
394    1
213    0
Name: satisfied, dtype: int32

In [30]:
# Use CategoricalNB() for performing Categorical Naive-Bayes on numerical data
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler

CB = CategoricalNB()
CB.fit(X_train, np.asarray(y_train))

print("-- classification Train Report --")
print(classification_report(y_train, CB.predict(X_train)))
print("-- classification Test Report --")
print(classification_report(y_test, CB.predict(X_test)))



-- classification Train Report --
              precision    recall  f1-score   support

           0       0.55      0.20      0.29       254
           1       0.76      0.94      0.84       703

    accuracy                           0.74       957
   macro avg       0.66      0.57      0.57       957
weighted avg       0.71      0.74      0.70       957

-- classification Test Report --
              precision    recall  f1-score   support

           0       0.66      0.28      0.39        68
           1       0.77      0.94      0.85       172

    accuracy                           0.75       240
   macro avg       0.71      0.61      0.62       240
weighted avg       0.74      0.75      0.72       240



### Exercise 2.3 - Naïve Bayes Classifier for Numerical Attributes (15 points)

Use the numerical attributes **only**, please build a Gaussian Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

**Remember to scale your data. The scaling method is up to you.**

In [21]:
# Remember to do this task with your processed data from Exercise 2.1
# Use GaussianNB() for performing Gaussian Naive-Bayes on numerical data
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
NB = GaussianNB()
# Step 1:Keep numerical attributes only and delete categorical_attributes
X_train = X_train_org.copy() #clone X_train_org
X_test  = X_test_org.copy()  #clone X_test_org
#delete all categorical_attributes
for col in X_train:
   if col in categorical_attributes:
    del X_train[col]
for col in X_test:
   if col in categorical_attributes:
    del X_test[col]
print("--- Train ----")
display(X_train.head())
display(y_train.head())
print("--- Test ----")
display(X_test.head())
display(y_test.head())

scaler.fit(X_train)
X_train_numerical_attributes_scaler = scaler.transform(X_train).copy() # save it to later use 
X_test_numerical_attributes_scaler = scaler.transform(X_test).copy()   # save it to later use
NB.fit(scaler.transform(X_train), np.asarray(y_train))
print("-- classification Train Report --")
print(classification_report(y_train, NB.predict(scaler.transform(X_train))))
print("-- classification Test Report --")
print(classification_report(y_test, NB.predict(scaler.transform(X_test))))

--- Train ----


Unnamed: 0,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
961,26.66,1164.0,6600,23,0.0,0,2,55.0
467,22.94,1871.0,6960,55,0.0,0,0,58.0
1048,4.6,0.0,960,0,0.0,0,0,8.0
1179,30.1,735.0,6960,63,0.0,0,1,58.0
710,30.1,1061.0,6960,30,0.0,0,1,58.0


961     1
467     1
1048    1
1179    1
710     0
Name: satisfied, dtype: int32

--- Test ----


Unnamed: 0,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers
230,42.27,465.0,9900,54,0.0,0,0,55.0
550,22.52,1282.0,6720,75,0.0,0,0,56.0
824,2.9,0.0,1440,0,0.0,0,0,12.0
394,3.94,0.0,1800,0,0.0,0,0,10.0
213,4.15,0.0,1440,0,0.0,0,0,8.0


230    0
550    1
824    1
394    1
213    0
Name: satisfied, dtype: int32

-- classification Train Report --
              precision    recall  f1-score   support

           0       0.73      0.04      0.08       254
           1       0.74      0.99      0.85       703

    accuracy                           0.74       957
   macro avg       0.74      0.52      0.47       957
weighted avg       0.74      0.74      0.65       957

-- classification Test Report --
              precision    recall  f1-score   support

           0       1.00      0.04      0.08        68
           1       0.73      1.00      0.84       172

    accuracy                           0.73       240
   macro avg       0.86      0.52      0.46       240
weighted avg       0.80      0.73      0.63       240



## Exercies 3 - SVM Classifier (40 points in total)

### Exercise 3.1 - Additional Data Preprocessing (10 points)

To build a SVM Classifier, we need a different encoding for our categorical variables.

 - For each of the **categorical attribtues**, encode them with **one-hot encoding**.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.


In [22]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Step 1:Keep numerical attributes only and delete categorical_attributes
X_train = X_train_org.copy() #clone X_train_org
X_test  = X_test_org.copy()  #clone X_test_org

# Define which columns should be encoded vs scaled
columns_to_scale = ['smv','wip','over_time','incentive','idle_time','idle_men','no_of_style_change','no_of_workers']
columns_to_encode  = ['quarter','department','day','team']

# Instantiate encoder/scaler
scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# ---- prepare X_train ----
# Scale and Encode Separate Columns
scaled_columns  = scaler.fit_transform(X_train[columns_to_scale]) 
encoded_columns =    ohe.fit_transform(X_train[columns_to_encode])

# Concatenate (Column-Bind) Processed Columns Back Together
X_train_processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

display(X_train_processed_data[1:2])

# ---- prepare X_test ----
# Scale and Encode Separate Columns
scaled_columns  = scaler.fit_transform(X_test[columns_to_scale]) 
encoded_columns =    ohe.fit_transform(X_test[columns_to_encode])

# Concatenate (Column-Bind) Processed Columns Back Together
X_test_processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

display(X_test_processed_data[1:2])

array([[ 0.71156403,  0.70466889,  0.71038598,  0.0850198 , -0.06338855,
        -0.1147066 , -0.35080434,  1.0421417 ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ]])

array([[ 0.71662743,  0.87812142,  0.66567643,  0.59743145, -0.11125405,
        -0.10599979, -0.35486811,  1.01284675,  1.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ]])

### Exercise 3.2 - SVM with Different Kernels (20 points)

Using all the attributes we have, please build a SVM that predicts the column `satisfied`. <br >
Specifically, please 
 - Build one SVM with **linear kernel**.
 - Build another SVM but with **rbf kernel**.
 - Report the **testing results** of **both models** using `classification report`.

The kernel is the only setting requirement. <br >
Other hyperparameter tuning is not required. But make sure they are the same in these two SVMs if you'd like to tune the model. In other words, the only difference between the two SVMs should be the kernel setting.

**Remember to scale your data. The scaling method is up to you.**

In [23]:
# Remember to do this task with your processed data from Exercise 3.1
#Import svm model
from sklearn import svm
from sklearn.gaussian_process.kernels import RBF

#Create a svm Classifier kernel='linear'
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(X_train_processed_data, y_train)
print("-- kernel='linear': classification Train Report --")
print(classification_report(y_train, clf.predict(X_train_processed_data)))
print("-- kernel='linear': classification Test Report --")
print(classification_report(y_test, clf.predict(X_test_processed_data)))

# Create a SVC classifier using an RBF kernel
svm = svm.SVC(kernel='rbf', random_state=0, gamma=.01, C=10)
# Train the classifier
svm.fit(X_train_processed_data, y_train)

print("-- kernel='RBF': classification Train Report --")
print(classification_report(y_train, svm.predict(X_train_processed_data)))
print("-- kernel='RBF': classification Test Report --")
print(classification_report(y_test, svm.predict(X_test_processed_data)))


-- kernel='linear': classification Train Report --
              precision    recall  f1-score   support

           0       0.85      0.04      0.08       254
           1       0.74      1.00      0.85       703

    accuracy                           0.74       957
   macro avg       0.79      0.52      0.47       957
weighted avg       0.77      0.74      0.65       957

-- kernel='linear': classification Test Report --
              precision    recall  f1-score   support

           0       1.00      0.04      0.08        68
           1       0.73      1.00      0.84       172

    accuracy                           0.73       240
   macro avg       0.86      0.52      0.46       240
weighted avg       0.80      0.73      0.63       240

-- kernel='RBF': classification Train Report --
              precision    recall  f1-score   support

           0       0.85      0.09      0.16       254
           1       0.75      0.99      0.86       703

    accuracy                     

### Exercise 3.3 - SVM with Over-sampling (10 points)
 - For the column `satisfied` in our **training set**, please **print out** the frequency of each class. 
 - Oversample the **training data**. 
 - For the column `satisfied` in the oversampled data, **print out** the frequency of each class  again.
 - Re-build the 2 SVMs with the same setting you have in Exercise 3.2, but **use oversampled training data** instead.
     - Do not forget to scale the data first. As always, the scaling method is up to you.
 - Report the **testing result** with `classification_report`.

You can use ANY methods listed on [here](https://imbalanced-learn.org/stable/references/over_sampling.html#) such as RandomOverSampler or SMOTE. <br > 
You are definitely welcomed to build your own oversampler. <br >

Note that you do not have to over-sample your testing data.

In [24]:
# Remember to do this task with your processed data from Exercise 3.1
#For the column satisfied in our training set, please print out the frequency of each class.
print(y_train.value_counts())

1    703
0    254
Name: satisfied, dtype: int64


In [25]:
#Oversample the training data.
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=21)
X_os_train, y_os = ros.fit_resample(X_train_processed_data, y_train)
print(y_os.value_counts())

0    703
1    703
Name: satisfied, dtype: int64


In [93]:
# Remember to continue the task with your processed data from Exercise 1
#Split the data into training and testing set with the ratio of 80:20 = already done 
print(X_os_train, X_os_train.shape)
    
print("--- Train ----")
display(X_os_train)
display(y_os)
print("--- Test ----")
display(X_test_processed_data)
display(y_test)

[[ 1.05128214  0.27511996  0.60125373 ...  0.          0.
   0.        ]
 [ 0.71156403  0.70466889  0.71038598 ...  0.          0.
   0.        ]
 [-0.96328278 -0.43208649 -1.10848494 ...  0.          0.
   0.        ]
 ...
 [-1.02355535 -0.43208649 -1.03573011 ...  0.          0.
   0.        ]
 [-1.1185303  -0.43208649 -0.96297527 ...  0.          0.
   0.        ]
 [-1.1185303  -0.43208649 -0.96297527 ...  0.          0.
   0.        ]] (1406, 33)
--- Train ----


array([[ 1.05128214,  0.27511996,  0.60125373, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.71156403,  0.70466889,  0.71038598, ...,  0.        ,
         0.        ,  0.        ],
       [-0.96328278, -0.43208649, -1.10848494, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.02355535, -0.43208649, -1.03573011, ...,  0.        ,
         0.        ,  0.        ],
       [-1.1185303 , -0.43208649, -0.96297527, ...,  0.        ,
         0.        ,  0.        ],
       [-1.1185303 , -0.43208649, -0.96297527, ...,  0.        ,
         0.        ,  0.        ]])

0       1
1       1
2       1
3       1
4       0
       ..
1401    0
1402    0
1403    0
1404    0
1405    0
Name: satisfied, Length: 1406, dtype: int32

--- Test ----


array([[ 2.53097099, -0.16122357,  1.56702345, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.71662743,  0.87812142,  0.66567643, ...,  1.        ,
         0.        ,  0.        ],
       [-1.08577362, -0.75277243, -0.83089976, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.97737233, -0.75277243,  0.29153238, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.02713281,  0.84377342,  0.68268297, ...,  0.        ,
         0.        ,  0.        ],
       [-1.08577362, -0.75277243, -0.558795  , ...,  0.        ,
         0.        ,  0.        ]])

230    0
550    1
824    1
394    1
213    0
      ..
724    1
324    0
392    0
68     1
366    0
Name: satisfied, Length: 240, dtype: int32

In [26]:
#Create a svm Classifier kernel='linear'
#clf = svm.SVC(kernel='linear') # Linear Kernel already done
#Train the model using the training sets
clf.fit(X_os_train, y_os)
print("-- kernel='linear': classification Train Report --")
print(classification_report(y_os, clf.predict(X_os_train)))
print("-- kernel='linear': classification Test Report --")
print(classification_report(y_test, clf.predict(X_test_processed_data)))

#Create a SVC classifier using an RBF kernel
#svm = svm.SVC(kernel='rbf', random_state=0, gamma=.01, C=10)  already done
# Train the classifier
svm.fit(X_os_train, y_os)

print("-- kernel='RBF': classification Train Report --")
print(classification_report(y_os, svm.predict(X_os_train)))
print("-- kernel='RBF': classification Test Report --")
print(classification_report(y_test, svm.predict(X_test_processed_data)))


-- kernel='linear': classification Train Report --
              precision    recall  f1-score   support

           0       0.68      0.78      0.73       703
           1       0.74      0.63      0.68       703

    accuracy                           0.71      1406
   macro avg       0.71      0.71      0.71      1406
weighted avg       0.71      0.71      0.71      1406

-- kernel='linear': classification Test Report --
              precision    recall  f1-score   support

           0       0.49      0.84      0.62        68
           1       0.91      0.65      0.76       172

    accuracy                           0.70       240
   macro avg       0.70      0.74      0.69       240
weighted avg       0.79      0.70      0.72       240

-- kernel='RBF': classification Train Report --
              precision    recall  f1-score   support

           0       0.69      0.86      0.77       703
           1       0.82      0.62      0.71       703

    accuracy                     