# Homework 3

Jiss Xavier, 916427256

In this assignment, we will be building a Naïve Bayes classifier and a SVM model for the productivity satisfaction of [the given dataset](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees), the productivity of garment employees.

## Background
The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. 

## Dataset Attribute Information

1. **date**: Date in MM-DD-YYYY
2. **day**: Day of the Week
3. **quarter** : A portion of the month. A month was divided into four quarters
4. **department** : Associated department with the instance
5. **team_no** : Associated team number with the instance
6. **no_of_workers** : Number of workers in each team
7. **no_of_style_change** : Number of changes in the style of a particular product
8. **targeted_productivity** : Targeted productivity set by the Authority for each team for each day.
9. **smv** : Standard Minute Value, it is the allocated time for a task
10. **wip** : Work in progress. Includes the number of unfinished items for products
11. **over_time** : Represents the amount of overtime by each team in minutes
12. **incentive** : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
13. **idle_time** : The amount of time when the production was interrupted due to several reasons
14. **idle_men** : The number of workers who were idle due to production interruption
15. **actual_productivity** : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

### Libraries that can be used: numpy, scipy, pandas, scikit-learn, cvxpy, imbalanced-learn
Any libraries used in the discussion materials are also allowed.

#### Other Notes

 - Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment. <br >
 - If not specified, you are not required to do hyperparameter tuning, but feel free to do so if you'd like.

#### Trouble Shooting
In case you have trouble installing and using imbalanced-learn(imblearn) <br >
Run the below code cell, then go to the selection bar at top: Kernel > Restart. <br >
Then try `import imblearn` to see if things work. 


Worked on high level concepts with Tejes Srivastava

In [123]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install imbalanced-learn delayed



# Exercises

## Exercise 1 - General Data Preprocessing (20 points)

Our dataset needs cleaning before building any models. Some of the cleaning tasks are common in general, but depends on what kind of models we are building, sometimes we have to do additional processing. These additional tasks will be mentioned in each of the remaining two exercises later.

Note that **we will be using this processed data from exercise 1 in each of the remaining two exercises**.

For convenience, here are the attributes that we would treat them as **categorical attributes**: `day`, `quarter`, `department`, and `team`. 

 - Drop the column `date`.
 - For each of the categorical attributes, **print out** all the unique elements.
 - For each of the categorical attributes, remap the duplicated items, if you find there are typos or spaces among the duplicated items.
     - For example, "a" and "a " should be the same, so we need to update "a " to be "a".
     - Another example, "apple" and "appel" should be the same, so you should update "appel" to be "apple".
     

 - Create another column named `satisfied` that records the productivity performance. The behavior defined as follows. **This is the dependent variable we'd like to classify in this assignment.**
     - Return True or 1 if `actual_productivity` is equal to or greater than `targeted_productivity`. Otherwise, return False or 0, which means the team fails to meet the expected performance.
 - Drop the columns `actual_productivity` and `targeted_productivity`.


 - Find and **print out** which columns/attributes that have empty vaules, e.g., NA, NaN, null, None.
 - Fill the empty values with 0.


In [2]:
# If put the data(.csv) under the same folder, you could use
# df = pd.read_csv('./garments_worker_productivity.csv')
import pandas as pd
import numpy as np

#reads in the file using the command provided above
df = pd.read_csv('./garments_worker_productivity.csv')

#Drops the column date
df = df.drop(columns=['date'])

#prints all the unique items
print('Unique days: ', df.day.unique(), '\nUnique quarter attributes: ', df.quarter.unique(),
      '\nUnique department attributes: ', df.department.unique(), '\nUnique team attributes: ', df.team.unique())

#fixes the one typo found from the unique elements listed above
df.loc[:, "department"] = df.department.replace("finishing ", "finishing")
print('\nUnique department attributes (after fixing typos):',df.department.unique())

#adds satisfied column using if actual_productivity is equal to or greater than targeted_productivity
df['satisfied'] = (df['actual_productivity'] >= df['targeted_productivity'])

#drops the actual_productivity and targeted_productivity columns
df = df.drop(columns=['actual_productivity'])
df = df.drop(columns=['targeted_productivity'])

#finds columns with empty values
emptyValueColumns = df.columns[df.isna().any()].tolist()
print('\nColumns/Attributes with empty values:', emptyValueColumns)

#fills the empty values with 0's
df=df.fillna(0)

Unique days:  ['Thursday' 'Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday'] 
Unique quarter attributes:  ['Quarter1' 'Quarter2' 'Quarter3' 'Quarter4' 'Quarter5'] 
Unique department attributes:  ['sweing' 'finishing ' 'finishing'] 
Unique team attributes:  [ 8  1 11 12  6  7  2  3  9 10  5  4]

Unique department attributes (after fixing typos): ['sweing' 'finishing']

Columns/Attributes with empty values: ['wip']


## Exercise 2 - Naïve Bayes Classifier (40 points in total)

### Exercise 2.1 - Additional Data Preprocessing (10 points)

To build a Naïve Bayes Classifier, we need to further encode our categorical variables.

 - For each of the **categorical attribtues**, encode the set of categories to be **0 ~ (n_classes - 1)**.
     - For example, \["paris", "paris", "tokyo", "amsterdam"\] should be encoded as \[1, 1, 2, 0\].
     - Note that the order does not really matter, i.e., \[0, 0, 1, 2\] also works. But you have to start with 0 in your encodings.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.

In [3]:
#Note: The programs must be run in order to work

# Remember to continue the task with your processed data from Exercise 1
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

#Using Label Encoding Code from 10/27 Discussion
df.loc[:,'day'] = LabelEncoder().fit_transform(df['day'])
df.loc[:,'quarter'] = LabelEncoder().fit_transform(df['quarter'])
df.loc[:,'department'] = LabelEncoder().fit_transform(df['department'])
df.loc[:,'team'] = LabelEncoder().fit_transform(df['team'])

print('Results after being Label Encoded: \n\n', df)

#Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['satisfied']), df['satisfied'], test_size=0.2)

Results after being Label Encoded: 

       quarter  department  day  team    smv     wip  over_time  incentive  \
0           0           1    3     7  26.16  1108.0       7080         98   
1           0           0    3     0   3.94     0.0        960          0   
2           0           1    3    10  11.41   968.0       3660         50   
3           0           1    3    11  11.41   968.0       3660         50   
4           0           1    3     5  25.90  1170.0       1920         50   
...       ...         ...  ...   ...    ...     ...        ...        ...   
1192        1           0    5     9   2.90     0.0        960          0   
1193        1           0    5     7   3.90     0.0        960          0   
1194        1           0    5     6   3.90     0.0        960          0   
1195        1           0    5     8   2.90     0.0       1800          0   
1196        1           0    5     5   2.90     0.0        720          0   

      idle_time  idle_men  no_of_styl

### Exercise 2.2 - Naïve Bayes Classifier for Categorical Attributes (15 points)

Use the categorical attributes **only**, please build a Categorical Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

In [4]:
#Note: The programs must be run in order to work
#Note: Exercise 2.1 must be run before this code can work

# Remember to do this task with your processed data from Exercise 2.1
import pandas as pd
import numpy as np
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import classification_report

#set the data frame for training and testing
train_df = pd.DataFrame()
test_df = pd.DataFrame()

#fill the data frame columns with the training and testing data
train_df['day'] = X_train['day']
test_df['day'] = X_test['day']

train_df['quarter'] = X_train['quarter']
test_df['quarter'] = X_test['quarter']

train_df['department'] = X_train['department']
test_df['department'] = X_test['department']

train_df['team'] = X_train['team']
test_df['team'] = X_test['team']

#from 10/27 Discussion Code: CategoricalNB
NB = CategoricalNB()
NB.fit(train_df, y_train)

#Reporting the testing results using classification_report
#print('Training results: \n', classification_report(y_train, NB.predict(train_df)))
print('Testing results: \n', classification_report(y_test, NB.predict(test_df)))

Testing results: 
               precision    recall  f1-score   support

       False       0.65      0.19      0.29        70
        True       0.74      0.96      0.84       170

    accuracy                           0.73       240
   macro avg       0.70      0.57      0.56       240
weighted avg       0.71      0.73      0.68       240



### Exercise 2.3 - Naïve Bayes Classifier for Numerical Attributes (15 points)

Use the numerical attributes **only**, please build a Gaussian Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

**Remember to scale your data. The scaling method is up to you.**

In [5]:
#Note: The programs must be run in order to work
#Note: Exercise 2.1 must be run before this code can work

# Remember to do this task with your processed data from Exercise 2.1
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler

#Remove the non numerical columns
X_train.drop(columns=['quarter', 'day', 'department', 'team'])
X_test.drop(columns=['quarter', 'day', 'department', 'team'])

#from 10/27 Discussion Code: Gaussian Naïve Bayes
scaler = StandardScaler()
NB = GaussianNB()
scaler.fit(X_train)
NB.fit(scaler.transform(X_train), np.asarray(y_train))

#Reporting the testing results using classification_report
#print('Training results: \n', classification_report(y_train, NB.predict(scaler.transform(X_train))))
print('Testing results: \n', classification_report(y_test, NB.predict(scaler.transform(X_test))))

Testing results: 
               precision    recall  f1-score   support

       False       0.75      0.04      0.08        70
        True       0.72      0.99      0.83       170

    accuracy                           0.72       240
   macro avg       0.73      0.52      0.46       240
weighted avg       0.73      0.72      0.61       240



## Exercies 3 - SVM Classifier (40 points in total)

### Exercise 3.1 - Additional Data Preprocessing (10 points)

To build a SVM Classifier, we need a different encoding for our categorical variables.

 - For each of the **categorical attribtues**, encode them with **one-hot encoding**.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.


In [6]:
# Remember to continue the task with your processed data from Exercise 1

#one-hot encoding using get_dummies (Piazza #249)
#https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
df = pd.get_dummies(df, columns=['quarter', 'day', 'department', 'team'])

#Split the data into training and testing (Same as 2.1)
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['satisfied']), df['satisfied'], test_size=0.2)

print('Results after being One-Hot Encoded: \n\n',df)

Results after being One-Hot Encoded: 

         smv     wip  over_time  incentive  idle_time  idle_men  \
0     26.16  1108.0       7080         98        0.0         0   
1      3.94     0.0        960          0        0.0         0   
2     11.41   968.0       3660         50        0.0         0   
3     11.41   968.0       3660         50        0.0         0   
4     25.90  1170.0       1920         50        0.0         0   
...     ...     ...        ...        ...        ...       ...   
1192   2.90     0.0        960          0        0.0         0   
1193   3.90     0.0        960          0        0.0         0   
1194   3.90     0.0        960          0        0.0         0   
1195   2.90     0.0       1800          0        0.0         0   
1196   2.90     0.0        720          0        0.0         0   

      no_of_style_change  no_of_workers  satisfied  quarter_0  ...  team_2  \
0                      0           59.0       True          1  ...       0   
1          

### Exercise 3.2 - SVM with Different Kernels (20 points)

Using all the attributes we have, please build a SVM that predicts the column `satisfied`. <br >
Specifically, please 
 - Build one SVM with **linear kernel**.
 - Build another SVM but with **rbf kernel**.
 - Report the **testing results** of **both models** using `classification report`.

The kernel is the only setting requirement. <br >
Other hyperparameter tuning is not required. But make sure they are the same in these two SVMs if you'd like to tune the model. In other words, the only difference between the two SVMs should be the kernel setting.

**Remember to scale your data. The scaling method is up to you.**

In [7]:
# Remember to do this task with your processed data from Exercise 3.1

#Adapted from 11/03 Discussion Code
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

clf = SVC(kernel='rbf')
clf_li = LinearSVC(dual=False)

#scaler from Exercise 2.3
scaler.fit(X_train)

clf.fit(scaler.transform(X_train), np.asarray(y_train))
clf_li.fit(scaler.transform(X_train), np.asarray(y_train))

print('RBF Kernel Testing results: \n',classification_report(y_test, clf.predict(scaler.transform(X_test))), '\n')
print('Linear Kernel Testing results: \n',classification_report(y_test, clf_li.predict(scaler.transform(X_test))), '\n')

RBF Kernel Testing results: 
               precision    recall  f1-score   support

       False       0.60      0.27      0.37        66
        True       0.77      0.93      0.84       174

    accuracy                           0.75       240
   macro avg       0.69      0.60      0.61       240
weighted avg       0.72      0.75      0.71       240
 

Linear Kernel Testing results: 
               precision    recall  f1-score   support

       False       0.61      0.29      0.39        66
        True       0.78      0.93      0.85       174

    accuracy                           0.75       240
   macro avg       0.69      0.61      0.62       240
weighted avg       0.73      0.75      0.72       240
 



### Exercise 3.3 - SVM with Over-sampling (10 points)
 - For the column `satisfied` in our **training set**, please **print out** the frequency of each class. 
 - Oversample the **training data**. 
 - For the column `satisfied` in the oversampled data, **print out** the frequency of each class  again.
 - Re-build the 2 SVMs with the same setting you have in Exercise 3.2, but **use oversampled training data** instead.
     - Do not forget to scale the data first. As always, the scaling method is up to you.
 - Report the **testing result** with `classification_report`.

You can use ANY methods listed on [here](https://imbalanced-learn.org/stable/references/over_sampling.html#) such as RandomOverSampler or SMOTE. <br > 
You are definitely welcomed to build your own oversampler. <br >

Note that you do not have to over-sample your testing data.

In [8]:
# Remember to do this task with your processed data from Exercise 3.1
from imblearn.over_sampling import SMOTE

#finding the frequency before oversampling
frequeny_preOverSampling = y_train.value_counts()
print("Frequency before oversampling:\n")
print(frequeny_preOverSampling,'\n')

#https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
smote = SMOTE(random_state=12)
x_train_oversampled, y_train_oversampled = smote.fit_resample(X_train, y_train)

#same as Exercise 3.2
scaler.fit(x_train_oversampled)
clf.fit(scaler.transform(x_train_oversampled), np.asarray(y_train_oversampled))
clf_li.fit(scaler.transform(x_train_oversampled), np.asarray(y_train_oversampled))

#finding the frequency before oversampling
frequeny_postOverSampling = y_train_oversampled.value_counts()
print("Frequency after oversampling:\n")
print(frequeny_postOverSampling,'\n')

print('RBF Kernel Testing results: \n',classification_report(y_test, clf.predict(scaler.transform(X_test))), '\n')
print('Linear Kernel Testing results: \n',classification_report(y_test, clf_li.predict(scaler.transform(X_test))), '\n')

Frequency before oversampling:

True     701
False    256
Name: satisfied, dtype: int64 

Frequency after oversampling:

False    701
True     701
Name: satisfied, dtype: int64 

RBF Kernel Testing results: 
               precision    recall  f1-score   support

       False       0.59      0.35      0.44        66
        True       0.79      0.91      0.84       174

    accuracy                           0.75       240
   macro avg       0.69      0.63      0.64       240
weighted avg       0.73      0.75      0.73       240
 

Linear Kernel Testing results: 
               precision    recall  f1-score   support

       False       0.59      0.30      0.40        66
        True       0.78      0.92      0.84       174

    accuracy                           0.75       240
   macro avg       0.68      0.61      0.62       240
weighted avg       0.72      0.75      0.72       240
 

