# Data Slicing Performance

Exercise repository: [Performance_testing_FinalExercise](https://github.com/mxagar/mlops-udacity-deployment-demos/tree/main/Performance_testing_FinalExercise).

In this exercise the [Raisin dataset](https://archive.ics.uci.edu/ml/datasets/Raisin+Dataset) is used. In it, 900 observations of 7 raising features (all numerical) are collected; each observation has a target class which specifies one of two types of raisins: 'Kecimen' or 'Besni'.

Data slicing should be applied manually to the dataset as follows:

- Data is split: `train`, `validation`.
- Logistic regression model is fit and F1 is computed: overall score.
- 3 features are chosen and their mean is computed; then, 2 buckets or ranges are defined for each of the 3 features: above and below the mean. Thus, we get 3x2 = 6 slices.
- For each slice, the F1 metric is computed again and compared to the overall F1.
- Finally, a model card is written.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report, f1_score

In [7]:
data = pd.read_csv("./exercise_data/Raisin_Dataset.csv")
y = data.pop("Class")

# Split the data into train and validation, stratifying on the target feature.
X_train, X_val, y_train, y_val = train_test_split(data, y, stratify=y, random_state=23)

In [8]:
# Get a high level overview of the data. This will be useful for slicing.
X_train.describe()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
count,675.0,675.0,675.0,675.0,675.0,675.0,675.0
mean,87210.494815,427.650555,254.414345,0.779895,90407.262222,0.701092,1159.625772
std,38388.571707,110.506268,49.752074,0.088938,39602.352484,0.050807,261.820857
min,25387.0,225.629541,144.618672,0.34873,26139.0,0.454189,619.074
25%,59032.5,343.732369,218.692197,0.740516,61466.5,0.671134,964.8355
50%,79057.0,405.936594,247.352044,0.797864,81779.0,0.709949,1117.107
75%,103790.5,493.185891,280.180509,0.840452,108022.5,0.735886,1302.4165
max,235047.0,843.956653,492.275279,0.92377,239093.0,0.830632,2253.557


In [10]:
data.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251


In [11]:
lr = LogisticRegression(max_iter=1000, random_state=23)
lb = LabelBinarizer()

# Binarize the target feature.
y_train = lb.fit_transform(y_train)
y_val = lb.transform(y_val)

# Train Logistic Regression.
lr.fit(X_train, y_train.ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=23, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### See the Solution Notebook!