# Various models were tested prior to selecting the one for inclusion in the final code

### The training data was read into the script for use with all the potential training models.  Basic data manipulation was completed.

In [1]:
# import initial needed libraries
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
#read in training data
file = 'winequality_training_data2.csv'
wine_df = pd.read_csv(file)

In [4]:
#cleanup data
wine_df = wine_df.dropna()

#drop any null columns
wine_df = wine_df.dropna(axis='columns', how='all')

In [5]:
#additional dataframe manimpulation to convert string values for wine colors to integers
wine_df = pd.get_dummies(wine_df, columns=['color'])

## Tested using RandomOverSampler with the LogisticRegression model.  

In [11]:
#import needed libraries
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#split up the data
X = wine_df.copy()
X = X.drop(columns='quality')
y = wine_df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

ros = RandomOverSampler()
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

Counter(y_resampled)

Counter({8: 15, 7: 15, 3: 15, 6: 15, 5: 15, 4: 15, 9: 15})

In [12]:
#build model
model = LogisticRegression(solver='lbfgs')

model.fit(X_resampled, y_resampled)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

#### Right away it would appear that the dataset that we were using was too large for successful usage of this model.  Continued with testing the model to verify its appropriateness or lack thereof.

In [14]:
#create predictions
y_pred = model.predict(X_test)

In [15]:
#use balanced accuracy score to assess accuracy
from sklearn.metrics import balanced_accuracy_score

balanced_accuracy_score(y_test, y_pred)

0.2571428571428572

#### The accuracy score provided was far below acceptable levels for this project.  Continued with further testing of other models.

## Testing LogisticRegression without utilizing RandomOverSampling.  

In [19]:
#create model
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

#### Received the size warning/error message again.  Again decided to contine with testing the model to verify appropriateness.

In [20]:
#create predictions
y_pred = model.predict(X_test)

In [21]:
balanced_accuracy_score(y_test, y_pred)

0.17142857142857143

#### Resulting accuracy score was lower than the previous sampling method thereby ruling out the potential use of this model.

In [22]:
#using alternate tester to see if accuracy was impacted
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

balanced_accuracy_score(y_test, y_pred)

0.19999999999999998

#### Accuracy improved, but not to a level that provided acceptable results

## Attempted using three different classifier models; Balanced Random Forest Classifier and Random Forest Classifier

In [24]:
#import libraries
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.ensemble import EasyEnsembleClassifier

rf_model = BalancedRandomForestClassifier()
rf_model = rf_model.fit(X_train, y_train)

predictions = rf_model.predict(X_test)
balanced_accuracy_score(y_test, predictions)

0.5142857142857143

In [25]:
rf_model = EasyEnsembleClassifier()
rf_model = rf_model.fit(X_train, y_train)

predictions = rf_model.predict(X_test)
balanced_accuracy_score(y_test, predictions)

0.34285714285714286

In [26]:
#randomforestclassifier
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model = rf_model.fit(X_train, y_train)

predictions = rf_model.predict(X_test)
balanced_accuracy_score(y_test, predictions)

0.4857142857142857

In [27]:
importances = rf_model.feature_importances_
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

[(0.13028499228574295, 'alcohol'),
 (0.11330455847690797, 'volatile acidity'),
 (0.097964373549354, 'sulphates'),
 (0.09171173847840954, 'fixed acidity'),
 (0.08576428758067721, 'density'),
 (0.08273137785403284, 'citric acid'),
 (0.08264824067394522, 'total sulfur dioxide'),
 (0.08194449524933392, 'chlorides'),
 (0.07958165705862494, 'pH'),
 (0.07300951037238344, 'free sulfur dioxide'),
 (0.06829818960975703, 'residual sugar'),
 (0.007507464123517582, 'color_red'),
 (0.005249114687313327, 'color_white')]

#### Upon first running these models, the increased results seemed very promising.  However, after further consideration, it was determined that the models were not the correct ones to use as we were trying to predict quality scores for wines which was a continuous number, not a dichotomous classification.  The resulting values were misleading and not applicable to our goal so these models were not further considered. The weighted importance of the features did provide some useful insight, however.

## After conferring with group members and others, it was determined that the problem was more of a regression concern.  Attempted to solve the concern with a multiple linear regression model.

In [29]:
#import additional library
from sklearn.linear_model import LinearRegression

#set up features to be used in model
X = wine_df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'color_red', 'color_white']]
y = wine_df['quality']

#build linear regression model
model = LinearRegression()

#fit the training data to model
model.fit(X, y)

LinearRegression()

In [30]:
#calculate accuracy score
score = model.score(X, y)
print(f'R2 Score: {score}')

R2 Score: 0.5612103200433789


#### Typically an R2 score of 70 or higher is desirable for an "acceptable" level of accuracy.  This was still determined to be the best possible outcome and option for predicting the quality scores of our wines.  Separate sample sets were taken from the training set and were run against the model to assess predictions and the resulting scores were determined as reasonably close; most were within .3 to .6 variance based on the tested sets.