In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, f1_score

In [2]:
df = pd.read_csv("quality_assurance.csv")

In [3]:
df_rows_dropped = df.dropna()

In [4]:
X = df_rows_dropped.drop("flawed", axis=1)
y = df_rows_dropped["flawed"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
first_model = LogisticRegression(max_iter=10000)
first_model.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [8]:
f1_score(y_train, first_model.predict(X_train))

0.8153846153846154

In [10]:
second_model = LogisticRegression(max_iter=10000, C=0.1)
second_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=0.1, max_iter=10000)

In [11]:
f1_score(y_train, second_model.predict(X_train))

0.8062015503875968

I choose the first model as the final model because the f1 score is better

In [12]:
final_model = LogisticRegression(max_iter=10000)
final_model.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [14]:
f1_score(y_test, final_model.predict(X_test))

0.6956521739130435

In [15]:
recall_score(y_test, final_model.predict(X_test))

0.64

The final model has a f1 score of roughly 0.70 which suggests that we are performing conservatively well on precision and recall with room for future improvement. The f1 score of our training set is approximately 16 points higher than the f1 score for our testing set. This indicates that we are overfitting on the training data. Our final model has a recall score of 0.64 means this model correctly identifies 64% of the flawed parts.

In [12]:
final_model.coef_

array([[ 2.16394680e-02, -1.76052234e-01, -1.44673484e+00,
        -1.61736898e+00,  1.42540710e-03,  9.54852371e-05,
        -6.46155970e-01, -4.60842095e-01,  8.09379835e-02,
         5.40051238e-02,  1.67402054e-02,  5.00504002e-02,
        -9.82493234e-02,  6.20780558e-03,  3.93786987e-03,
         1.01746266e-02,  5.97070437e-03, -3.48426383e-03,
        -3.98919827e-03,  2.27766654e-03, -2.56863211e-02,
        -2.48962697e-03, -1.30645528e-02, -6.46668401e-04,
        -7.68377966e-04,  8.01408856e-05]])

The sixth and last coefficients are very small, so those features might not be needed.  But this data has not been scaled so it could just be that the magnitude of the values in that column were larger.  More investigation is needed