## Preparing the Dataset
The dataset we are using is **OtomeGames.csv**, which we will make readable by Python with the two code chunks below.

In [1]:
import csv
def prepare_datasets(file_path):
    """ 
    Accepts: path to a tab-separated plaintext file
    Returns: a list containing a dictionary for every row in the file, 
        with the file column headers as keys
    """
    
    with open(file_path) as infile:
        reader = csv.DictReader(infile, delimiter=',')
        list_of_dicts = [dict(r) for r in reader]
        
    return list_of_dicts

In [2]:
otome_games = prepare_datasets("csvfiles/OtomeGames.csv")

## Converting to a DataFrame Object in Pandas
Here, we will make it even more digestible to Pandas, a library in Python. We convert the dataset into a DataFrame object. After that, we turned all the empty cells into NaN values, so they would be digestible by the correlation coefficient functions.

In [3]:
import pandas as pd
import numpy as np
games_df = pd.DataFrame(otome_games)

In [4]:
new_games_df = games_df[['Copies1stWeek', 'CopiesTotal', 'NoLI', 'NoFemale',
                         'NoFemaleLI', 'NoFemaleFI', 'NoLGBT',]]
new_games_df.replace('', np.nan, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_games_df.replace('', np.nan, inplace=True)


# Calculating Correlation Coefficients

### Pearson Coefficient

In [5]:
print(new_games_df.corr(method="pearson"))

               Copies1stWeek  CopiesTotal      NoLI  NoFemale  NoFemaleLI  \
Copies1stWeek       1.000000     0.878076  0.398802 -0.093105   -0.093844   
CopiesTotal         0.878076     1.000000  0.306858  0.037081   -0.073801   
NoLI                0.398802     0.306858  1.000000 -0.042660    0.044341   
NoFemale           -0.093105     0.037081 -0.042660  1.000000    0.031413   
NoFemaleLI         -0.093844    -0.073801  0.044341  0.031413    1.000000   
NoFemaleFI          0.338679     0.479197  0.425383  0.091782    0.286812   
NoLGBT             -0.102215    -0.083519  0.048320 -0.030395    0.532482   

               NoFemaleFI    NoLGBT  
Copies1stWeek    0.338679 -0.102215  
CopiesTotal      0.479197 -0.083519  
NoLI             0.425383  0.048320  
NoFemale         0.091782 -0.030395  
NoFemaleLI       0.286812  0.532482  
NoFemaleFI       1.000000  0.088336  
NoLGBT           0.088336  1.000000  


### Spearman Coefficient

In [6]:
print(new_games_df.corr(method="spearman"))

               Copies1stWeek  CopiesTotal      NoLI  NoFemale  NoFemaleLI  \
Copies1stWeek       1.000000     0.901830  0.332617 -0.048499   -0.162087   
CopiesTotal         0.901830     1.000000  0.301250  0.046660   -0.162467   
NoLI                0.332617     0.301250  1.000000  0.029332   -0.013525   
NoFemale           -0.048499     0.046660  0.029332  1.000000    0.079686   
NoFemaleLI         -0.162087    -0.162467 -0.013525  0.079686    1.000000   
NoFemaleFI          0.071183     0.171073  0.353651  0.086650    0.227856   
NoLGBT             -0.122569    -0.121633  0.051968  0.008699    0.693876   

               NoFemaleFI    NoLGBT  
Copies1stWeek    0.071183 -0.122569  
CopiesTotal      0.171073 -0.121633  
NoLI             0.353651  0.051968  
NoFemale         0.086650  0.008699  
NoFemaleLI       0.227856  0.693876  
NoFemaleFI       1.000000  0.117120  
NoLGBT           0.117120  1.000000  


### Kendall Coefficient

In [7]:
print(new_games_df.corr(method="kendall"))

               Copies1stWeek  CopiesTotal      NoLI  NoFemale  NoFemaleLI  \
Copies1stWeek       1.000000     0.752371  0.251110 -0.033345   -0.132960   
CopiesTotal         0.752371     1.000000  0.227484  0.038271   -0.133302   
NoLI                0.251110     0.227484  1.000000  0.020708   -0.011978   
NoFemale           -0.033345     0.038271  0.020708  1.000000    0.070413   
NoFemaleLI         -0.132960    -0.133302 -0.011978  0.070413    1.000000   
NoFemaleFI          0.056895     0.135626  0.301946  0.079177    0.223138   
NoLGBT             -0.100775    -0.099374  0.045129  0.007188    0.690153   

               NoFemaleFI    NoLGBT  
Copies1stWeek    0.056895 -0.100775  
CopiesTotal      0.135626 -0.099374  
NoLI             0.301946  0.045129  
NoFemale         0.079177  0.007188  
NoFemaleLI       0.223138  0.690153  
NoFemaleFI       1.000000  0.114291  
NoLGBT           0.114291  1.000000  


## What coefficients should we keep note of?
For each of the types of coefficients (Pearson, Spearman, and Kendall), we should be looking at the rows that are labeled with **Copies1stWeek** and **CopiesTotal** and their intersections with relevant categories.
A positive coefficient means that category has a positive impact towards CopiesTotal/Copies1stWeek, meaning that when that category increases in number, CopiesTotal/Copies1stWeek also happen to increase at the same time. If the value is closer to 0, there is less likely of a correlation. If the number is closer to 1, it means there is a 1-on-1 correlation.
### Number of Love Interests: NoLI
Pearson:
- Copies1stWeek: 0.398802
- CopiesTotal: 0.306858

Spearman:
- Copies1st Week: 0.332617
- CopiesTotal: 0.301250
  
Kendall:
- Copies1stWeek: 0.251110
- CopiesTotal: 0.227484

Overall, there is a positive correlation between all of them and they are all very similar in terms of values, with the Kendall coefficient being noticeably smaller than the Pearson and Spearman coefficients. It is safe to say that the number of love interests and the copies sold are positively correlated.

### Number of Female Main Characters:
Pearson:
- Copies1stWeek: -0.093105
- CopiesTotal: 0.037081

Spearman:
- Copies1stWeek: -0.048499
- CopiesTotal: 0.046660
  
Kendall:
- Copies1stWeek: -0.033345
- CopiesTotal: 0.038271

The correlation values vary a lot more within this category, especially with the values being extremely close to 0 and their alternation between negative and positive value between Copies1stWeek and CopiesTotal. Because of this, it is safe to say that the number of female main characters does not affect the sales of otome games.

### Number of Female Love Interests:
Pearson:
- Copies1stWeek: -0.093844
- CopiesTotal: -0.073801

Spearman:
- Copies1stWeek: -0.162087
- CopiesTotal: -0.162467
  
Kendall:
- Copies1stWeek: -0.132960
- CopiesTotal: -0.133302

Although the values are arguably close to 0, the fact that all of the values are negative is seemingly significant. It is reasonable to assume that the number of female love interests in an otome game has a small yet negative effect towards sales.

### Number of Female Friendship Interests:
Pearson:
- Copies1stWeek: 0.338679
- CopiesTotal: 0.479197

Spearman:
- Copies1stWeek: 0.071183
- CopiesTotal: 0.135626
  
Kendall:
- Copies1stWeek: 0.056895
- CopiesTotal: 0.135626

This one is interesting! The Pearson coefficient is very different compared to the Spearman and Kendall coefficients, with it being much higher. If we follow the Pearson coefficient, we could say that there is a relatively significant positive correlation between the number of female friendship interests and the number of sales. However, it is important to notice that the Spearman and Kendall values are still close to 0, albeit they are positive. With all of those factors in mind, it is most likely that the number of female friendship interests has a small yet positive impact on sales.

### Number of LGBT characters
Pearson:
- Copies1stWeek: -0.102215
- CopiesTotal: -0.083519

Spearman:
- Copies1stWeek: -0.122569
- CopiesTotal: -0.121633
  
Kendall:
- Copies1stWeek: -0.100775
- CopiesTotal: -0.099374

Similar to the number of female interests, the coefficients are all very close to 0 but are all negative values. It is reasonble to assume that the number of LGBT characters in an otome game has a small yet negative effect towards sales.

# What's Next?
Next, we go onto **machinelearning.ipynb** where we try to optimize the sales of otome games!