Data library

### Column Name: Menstrual_Flow ### (ID = Q4)
- **Description**: Describes the volume of menstrual flow
- **Data Type**: Categorical
- **Permissible Values**: Light, Moderate, Heavy, Very Heavy
- **Notes**: Derived from questionnaire

### Column Name: Menstrual_Frequency  (ID = Q5)
- **Description**: Frequency of menstrual periods
- **Data Type**: Categorical
- **Permissible Values**: Regularly, Infrequently, Absence
- **Notes**: Derived from questionnaire

### Column Name: Pain_During_Menstruation  (ID = Q6)
- **Description**: Presence and intensity of pain during menstruation
- **Data Type**: Categorical
- **Permissible Values**: No, Mild cramps, Painful cramps
- **Notes**: Derived from questionnaire

### Column Name: Irregular_Bleeding  (ID = Q7)
- **Description**: Occurrence of bleeding at irregular intervals between periods
- **Data Type**: Categorical
- **Permissible Values**: No, Yes
- **Notes**: Derived from questionnaire

### Column Name: Premenstrual_Symptoms  (ID = Q8)
- **Description**: Presence and severity of premenstrual symptoms
- **Data Type**: Categorical
- **Permissible Values**: No, Mild symptoms, Severe symptoms
- **Notes**: Derived from questionnaire

### Column Name: Menstrual_Period_Duration  (ID = Q9)
- **Description**: Duration of menstrual periods
- **Data Type**: Categorical
- **Permissible Values**: Less than 2 days, 3-5 days, More than 7 days
- **Notes**: Derived from questionnaire

### Column Name: Recent_Changes  (ID = Q10)
- **Description**: Recent changes in menstrual cycle
- **Data Type**: Categorical
- **Permissible Values**: No, Yes
- **Notes**: Derived from questionnaire

### Column Name: Condition_Met
- **Description**: Indicator if a specific condition is met within the study
- **Data Type**: Boolean
- **Permissible Values**: True, False
- **Notes**: Used for traning (label) purposes, reflects boolean logic


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from joblib import dump
import os

Paths

In [2]:
# Define the directory and model file path
dir = './'
model_path = os.path.join(dir, 'svm_model.joblib')

# Load training and testing data from .csv files
train = pd.read_csv(os.path.join(dir, 'SVM_train.csv'))
test = pd.read_csv(os.path.join(dir, 'SVM_test.csv'))

Sneak Peak at dataset

In [3]:
test.sample(5)

Unnamed: 0,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Anomaly
658,2,1,1,1,1,2,1,True
142,3,2,0,0,1,3,1,False
156,3,2,0,0,1,3,1,False
27,3,2,0,0,1,3,1,False
693,2,1,1,1,1,2,1,True


Reproducibility

In [8]:
random_configed = 42
np.random.seed(random_configed)
!pip freeze > ../requirements-frozen.txt

In [5]:
# Separate features and target
X_train = train.iloc[:, :-1]  # All columns except the last one
y_train = train.iloc[:, -1]  # Only the last column
X_test = test.iloc[:, :-1]  # All columns except the last one
y_test = test.iloc[:, -1]  # Only the last column

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the SVM model
model = SVC(kernel='linear', random_state=random_configed)
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Print the classification report and confusion matrix
report = classification_report(y_test, y_pred, output_dict=True)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Save the updated model
dump(model, model_path)

# Print the summarized metrics from the classification report
print(f"Precision: {report['macro avg']['precision']:.2f}")
print(f"Recall: {report['macro avg']['recall']:.2f}")
print(f"F1 Score: {report['macro avg']['f1-score']:.2f}")
print(f"Accuracy: {report['accuracy']:.2f}")

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       500
        True       1.00      1.00      1.00       500

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000

[[500   0]
 [  0 500]]
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Accuracy: 1.00


In [6]:
#Overview if the data
print(train.head(10))
print(test.head(10))  

# Print the counts of 'True' and 'False' in the training data
print(y_train.value_counts())
# Print the counts of 'True' and 'False' in the test data
print(y_test.value_counts())

   Q4  Q5  Q6  Q7  Q8  Q9  Q10  Anomaly
0   2   1   1   1   1   2    1     True
1   2   1   1   1   1   2    1     True
2   2   1   1   1   1   2    1     True
3   3   2   0   0   1   3    1    False
4   3   2   0   0   1   3    1    False
5   2   1   1   1   1   2    1     True
6   3   2   0   0   1   3    1    False
7   3   2   0   0   1   3    1    False
8   3   2   0   0   1   3    1    False
9   3   2   0   0   1   3    1    False
   Q4  Q5  Q6  Q7  Q8  Q9  Q10  Anomaly
0   3   2   0   0   1   3    1    False
1   2   1   1   1   1   2    1     True
2   3   2   0   0   1   3    1    False
3   2   1   1   1   1   2    1     True
4   2   1   1   1   1   2    1     True
5   2   1   1   1   1   2    1     True
6   2   1   1   1   1   2    1     True
7   2   1   1   1   1   2    1     True
8   3   2   0   0   1   3    1    False
9   3   2   0   0   1   3    1    False
Anomaly
True     2000
False    2000
Name: count, dtype: int64
Anomaly
False    500
True     500
Name: count, dtype: int6