## Preparing the Dataset
The dataset we are using is **OtomeGames.csv**, which we will make readable by Python with the two code chunks below.

In [1]:
import csv
def prepare_datasets(file_path):
    """ 
    Accepts: path to a tab-separated plaintext file
    Returns: a list containing a dictionary for every row in the file, 
        with the file column headers as keys
    """
    
    with open(file_path) as infile:
        reader = csv.DictReader(infile, delimiter=',')
        list_of_dicts = [dict(r) for r in reader]
        
    return list_of_dicts

In [2]:
otome_games = prepare_datasets("csvfiles/OtomeGames.csv")

Below, we clean up the empty cells in the dataset and replace them with NaN values. Then we take the relevant pieces of numerical data and convert them from string to floats.

In [3]:
import pandas as pd
import numpy as np
games_df = pd.DataFrame(otome_games)
games_df.replace('', np.nan, inplace=True)
games_df[['NoLI', 'NoFemale', 'NoFemaleLI', 'NoFemaleFI', 'NoLGBT', 'Copies1stWeek', 'CopiesTotal']] = games_df[['NoLI', 'NoFemale', 'NoFemaleLI',
                                                                                                                 'NoFemaleFI', 'NoLGBT','Copies1stWeek', 'CopiesTotal']].astype(float)

## Separating First Week Sales and Total Copies Sold
Because we are trying to optimize both Copies1stWeek and CopiesTotal, we will be creating two separate Pandas DataFrame objects, each one having the same independent variables but with different dependent variables.

In [4]:
rel_games_df = games_df[['Year', 'Copies1stWeek', 'CopiesTotal', 'NoLI', 'NoFemale', 'NoFemaleLI', 'NoFemaleFI', 'NoLGBT']]

firstweek_df = rel_games_df.drop("CopiesTotal", axis=1)
totalsales_df = rel_games_df.drop("Copies1stWeek", axis=1)

# Making a Model for Copies1stWeek

First we'll copy the dataset to predict the copies sold during its first week after release. Since there are empty cells in within Copies1stWeek for some otome games, we dropped the values that contained NaN within Copies1stWeek.
We then assign NoLI, NoFemale, NoFemaleLI, NoFemaleFI, and NoLGBT as independent variables (x) and Copies1stWeek as a dependent variable (y).

In [5]:
from sklearn.preprocessing import StandardScaler

# Split data into features and label 
firstweek_df.dropna(subset=['Copies1stWeek'], inplace=True)
x = firstweek_df[['NoLI', 'NoFemale', 'NoFemaleLI', 'NoFemaleFI', 'NoLGBT']].copy()
y = firstweek_df["Copies1stWeek"].copy() 

## Splitting the data into training set and test set
Below, we split the data into a training and test set. The training set is 70% of the data, while 30% would be the testing data.

In [6]:
from sklearn.model_selection import train_test_split

# Split data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=.7,
                                                           random_state=25)

## Creating the model
We create the model in this next step! Specifically, we create a linear regression model and fit it to the training data and then later apply it to the testing data for the x variable.

In [7]:
from sklearn.linear_model import LinearRegression

# Instnatiating the models 
model = LinearRegression()
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()

x_test = x_test.to_numpy()
y_test = y_test.to_numpy()
# Training the models 
model.fit(x_train, y_train)

# Making predictions with each model
lin_reg_preds = model.predict(x_test)

## Evaluating the model
We evaluate the accuracy of the model here by printing out the Mean Squared Error and the R-squared, which we get by comparing the actual values of Copies1stWeek and the model's output.

In [8]:
from sklearn.metrics import mean_squared_error, r2_score
# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, lin_reg_preds))
print("R-squared:", r2_score(y_test, lin_reg_preds))

Mean Squared Error: 42877454.16456728
R-squared: -0.30999625414575505


The Mean Squared Error measures how close a regression line is to a set of data points, which is y_test in this case. We can see the the mean squared error is a very, VERY, large number, which means that the model itself is NOT accurate. The R-squared typically ranges from 0 to 1 and if there is a negative R-squared, it means the data fits the model extremely poorly. This further confirms that our otome game model is very innaccurate. 

Nonetheless, let's go on!

## Using the Model to Optimize the Copies1stWeek
First, we look at the coefficients for each one of the five x-variables and the y intercept. Then we create the equation that maximizes sales. We then define the constraints that we see within the datasets (for example, there is a minimum of 1 love interest and a maximum of 19 within the entire **OtomeGames.csv** file).
After that, I created some initial guesses on what the optimal values would be. I guessed 7 love interests, 4 female characters, 1 female love interest, 1 female friendship interest, and 0 LGBT characters based on what I saw after scrolling through the data.
Then, we ran them all through the equation!

In [9]:
from scipy.optimize import minimize

# Model coefficients (for optimization)
print("Coefficients:", model.coef_)  # Effect of each feature on sales
print("Intercept:", model.intercept_)

# Objective function: negative sales (we minimize this to maximize sales)
def objective(features):
    # 'NoLI', 'NoFemale', 'NoFemaleLI', 'NoFemaleFI', 'NoLGBT'
    noli, nofemale, nofemaleli, nofemalefi, nolgbt = features

    return -(model.coef_[0] * noli + model.coef_[1] * nofemale + model.coef_[2] * 
             nofemaleli + model.coef_[3] * nofemalefi 
             + model.coef_[4] * nolgbt + 
             model.intercept_)

# Constraints (e.g., x-value ranges)
constraints = (
    {'type': 'ineq', 'fun': lambda x: x[0] - 1},  # noli >= 1
    {'type': 'ineq', 'fun': lambda x: 19 - x[0]},  # noli <= 19
    {'type': 'ineq', 'fun': lambda x: x[1] - 1},  # nofemale >= 1
    {'type': 'ineq', 'fun': lambda x: 9 - x[1]},  # nofemale <= 9
    {'type': 'ineq', 'fun': lambda x: x[2] - 0},  # nofemaleli >= 0
    {'type': 'ineq', 'fun': lambda x: 2 - x[2]},  # nofemaleli <= 2
    {'type': 'ineq', 'fun': lambda x: x[3] - 0},  # nofemalefi >= 0
    {'type': 'ineq', 'fun': lambda x: 3 - x[3]},  # nofemalefi <= 3
    {'type': 'ineq', 'fun': lambda x: x[4] - 0},  # nolgbt >= 0
    {'type': 'ineq', 'fun': lambda x: 1 - x[4]}  # nolgbt <= 1
)

# Initial guess
initial_guess = [7, 4, 1, 1, 0]

# Optimize
result = minimize(objective, initial_guess, constraints=constraints)
optimal_noli, optimal_nofemale, optimal_nofemaleli, optimal_nofemalefi, optimal_nolgbt = result.x

Coefficients: [  1284.50790959   -483.92110737 -10001.80662107   6450.00063339
  -3740.63019054]
Intercept: 3584.659952351167


# Results: Optimized Values for Copies1stWeek
Drumroll!!!

In [10]:
print("Optimal Number of Love Interests:", optimal_noli)
print("Optimal Number of Female Characters:", optimal_nofemale)
print("Optimal Number of Female Love Interests:", optimal_nofemaleli)
print("Optimal Number of Female Friendship Interests:", optimal_nofemalefi)
print("Optimal Number of LGBT characters:", optimal_nolgbt)

Optimal Number of Love Interests: 19.0000299884407
Optimal Number of Female Characters: 0.9999886595493308
Optimal Number of Female Love Interests: -0.00023582421090395655
Optimal Number of Female Friendship Interests: 3.000152048887685
Optimal Number of LGBT characters: -8.820710081636207e-05


19 love interests! That's a lot!
1 female character? That's just the protagonist.
0 female love interests? But 3 friendship interests, even though we only have one female character? That's contradictory. And 0 LGBT characters.
Keep in mind that this model is extremely innacurate!

Time to repeat the process but with CopiesTotal, instead!

# Making a Model for CopiesTotal
We do the exact same thing, but this time, we change the dependent variable, "y", to be **CopiesTotal** instead! The rest of the process is exactly the same as above.

In [11]:
totalsales_df.dropna(subset=['CopiesTotal'], inplace=True)
x = totalsales_df[['NoLI', 'NoFemale', 'NoFemaleLI', 'NoFemaleFI', 'NoLGBT']].copy()
y = totalsales_df["CopiesTotal"].copy() 

# Split data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=.7,
                                                           random_state=25)

# Instanatiating the models 
model = LinearRegression()
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()

x_test = x_test.to_numpy()
y_test = y_test.to_numpy()
# Training the models 
model.fit(x_train, y_train)

# Making predictions with each model
lin_reg_preds = model.predict(x_test)

# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, lin_reg_preds))
print("R-squared:", r2_score(y_test, lin_reg_preds))

# Model coefficients (for optimization)
print("Coefficients:", model.coef_)  # Effect of each feature on sales
print("Intercept:", model.intercept_)

result = minimize(objective, initial_guess, constraints=constraints)
optimal_noli, optimal_nofemale, optimal_nofemaleli, optimal_nofemalefi, optimal_nolgbt = result.x

Mean Squared Error: 373731315.9722419
R-squared: -0.687066828403814
Coefficients: [ 2389.02653409  -225.94716741     0.         17060.6281709
     0.        ]
Intercept: 1549.6157938347715


It looks like this model is also very innacurate compared to the actual data, similar to our model for Copies1stWeek.
# Results: Optimized Values for CopiesTotal
Drumroll #2!!

In [12]:
print("Optimal Number of Love Interests:", optimal_noli)
print("Optimal Number of Female Characters:", optimal_nofemale)
print("Optimal Number of Female Love Interests:", optimal_nofemaleli)
print("Optimal Number of Female Friendship Interests:", optimal_nofemalefi)
print("Optimal Number of LGBT characters:", optimal_nolgbt)

Optimal Number of Love Interests: 18.99995498937642
Optimal Number of Female Characters: 1.0000042261169426
Optimal Number of Female Love Interests: 1.0
Optimal Number of Female Friendship Interests: 2.9996766522308462
Optimal Number of LGBT characters: 0.0


Overall, very similar to the previous model, except that we gained one female love interest!

### Relevant citations
Code was referenced from https://www.datacamp.com/tutorial/machine-learning-python