# COGS 118B - Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project must include some elements of unsupervised learning, but you are welcome to include some supervised or other learning approaches as well.
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Pelé
- Diego Maradonna
- Johan Cruyff
- Roberto Carlos
- Franz Beckenbaur

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

# Background

Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 2 or 3 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

You should have a strong idea of what dataset(s) will be used to accomplish this project. 

If you know what (some) of the data you will use, please give the following information for each dataset:
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc will be needed

If you don't yet know what your dataset(s) will be, you should describe what you desire in terms of the above bullets.

In [3]:
# imports
import pandas as pd
import numpy as np
import json
import gzip 
import matplotlib.pyplot as plt
import matplotlib.cm as cm

In [6]:
# load in json file 
with gzip.open('goodreads_book_genres_initial.json.gz', 'rt', encoding = 'utf-8') as f:
    df = pd.read_json(f, lines = True)

BadGzipFile: Not a gzipped file (b'{"')

In [None]:
# load in csv file with first 200000 row 
user_df = pd.read_csv('goodreads_interactions.csv',nrows=200000)
user_df

In [None]:
# only include 800000 books
book_df = df.sample(800000, random_state=1)
book_df

In [None]:
# clean the genres to list 
book_df['genres_lst'] = book_df['genres'].apply(lambda x: list(x.keys()))
book_df = book_df[book_df['genres_lst'].str.len() != 0]
book_df

In [None]:
# merge book and users to merged_df
merged_df = user_df.merge(book_df, how = 'left', left_on = 'book_id', right_on = 'book_id').dropna()
merged_df['user_id'].value_counts()

In [None]:
# check if there are any np.nan value left 
merged_df.isna().all()

In [None]:
# import ML packages
from sklearn.cluster import KMeans
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score

### K-means cluster

In [None]:
# One-hot encode the genres_lst columns (total unique 10 genres)
mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(merged_df['genres_lst'])
genres_encoded

In [None]:
# create df with encoded genres 
genres_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)
genres_df['ident'] = [i for i in range((genres_df).shape[0])]
genres_df.head()

In [None]:
# create df with only useful features 
only_users = merged_df[['user_id', 'is_read', 'rating']]
only_users['ident'] = [i for i in range((only_users).shape[0])]
only_users.head()

In [None]:
# contains user and books info 
user_genres = only_users.merge(genres_df, how = 'right', left_on = 'ident', right_on = 'ident')
user_genres = user_genres.drop(['ident'], axis = 1)
user_genres

In [None]:
# contains unique users with normalized genres, mean of rating, sum of is_reading over all the books they read 
user_df = user_genres.groupby('user_id').agg({**{'rating': 'mean', 
                                                      'is_read': 'sum'},
                                                   **{genre: 'sum' for genre in mlb.classes_}})

row_sums = user_df[mlb.classes_].sum(axis=1)
user_df[mlb.classes_] = user_df[mlb.classes_].div(row_sums, axis=0)
user_df = user_df.drop(['is_read'], axis = 1)
user_df

In [None]:
# create and fit kmeans 
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(user_df)

In [None]:
# add label in clusters
user_df['cluster'] = kmeans.labels_

In [None]:
user_df

In [None]:
# eval matrix
silhouette_avg = silhouette_score(user_df, kmeans.labels_)
silhouette_avg

In [None]:
# graph silhouette
silhouette_avg = silhouette_score(user_df, kmeans.labels_)
sample_silhouette_values = silhouette_samples(user_df, kmeans.labels_)

# Create a subplot with 1 row and 2 columns
fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(18, 7)

cluster_n = 5
ax1.set_xlim([-0.1, 1])
# The (5+1)*10 is for inserting blank space between silhouette
ax1.set_ylim([0, len(user_df) + (cluster_n + 1) * 10])

y_lower = 10

#5 clusters
for i in range(cluster_n):
    ith_cluster_silhouette_values = sample_silhouette_values[kmeans.labels_ == i]
    ith_cluster_silhouette_values.sort()
    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i
    color = cm.nipy_spectral(float(i) / cluster_n)
    ax1.fill_betweenx(np.arange(y_lower, y_upper),
                      0, ith_cluster_silhouette_values,
                      facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10  


ax1.set_xlabel("silhouette coefficient values")
ax1.set_ylabel("Cluster")

# Vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

# Clear the yaxis labels
ax1.set_yticks([])  
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

# Display the silhouette plot
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
              f'with n_clusters = {cluster_n}'),
             fontsize=14, fontweight='bold')

plt.show()

# Proposed Solution

We propose to develop an unsupervised machine learning model to recommend books to users based on their interactions and preferences. We will harness two datasets: `goodreads_book_genres_initial.json`, which contains information on books and their genres, and `goodreads_interactions.csv`, which details user interactions with books, including readings, reviews, and ratings. The proposed model will follow these steps:

1. **Data Preparation**: We will use the top 20,000 rows from `goodreads_interactions.csv`, as these users have read the most books and thus can provide more information for us to train our model. We will also randomly sample 80,000 books from `goodreads_book_genres_initial.json` to ensure a sufficiently large sample. The `rating` and `has_read` fields from `goodreads_interactions.csv` will be dropped. Since we need to use `has_read` for our book recommendation algorithm later, and `rating` to test how effective is our model at the end. The two datasets will then be merged to correlate users' interactions with specific book genres, enabling a comprehensive view of user preferences.

2. **Feature Engineering**: We will focus on the genres of books a user has read to summarize their book preferences.We belive the genres feature can capture the user preference for books since users tend to prefer certain genres depending ontheir personalities or experiences, and they tend to keep this preference for genres for long term.The aggregated data will be normalized to ensure uniformity and facilitate meaningful comparison between users.

3. **Clustering with K-Means**: K-Means clustering will be implemented to group users based on their normalized interaction profiles. The initial number of clusters (k) will be chosen based on domain knowledge and refined using silhouette analysis to ensure optimal cluster quality.

4. **Book Recommendations**: Books will be recommended to users based on cluster membership. A book read by one user in a cluster can be recommended to another user in the same cluster who has not yet interacted with that book.

5. **Implementation Details**: We plan to use Python for implementation, utilizing libraries such as Pandas for data manipulation, Scikit-learn for K-Means clustering and silhouette analysis, and NumPy for numerical operations. Code reproducibility will be ensured by using our class conda environment and providing step-by-step execution instructions.

### Testing the Solution
To test our solution, we will focus on the ratings provided by users to assess the effectiveness of our recommendations. Specifically, we will evaluate whether users within the same cluster tend to give higher ratings to books that are commonly read and appreciated within their cluster. This approach presupposes that effective recommendations should align with users' existing preferences, as indicated by their ratings. The testing strategy will involve:

1. **Rating Analysis within Clusters**: We will calculate the average rating for books within each cluster that are recommended based on shared interests. This will involve comparing the average ratings of recommended books to a baseline to assess the quality of recommendations.

2. **Cross-Validation**: A form of cross-validation tailored for unsupervised learning will be implemented, where a portion of user interactions is held out from the clustering process. We will predict their ratings for unseen books based on cluster averages. These predicted ratings will then be compared to the actual ratings to measure accuracy.


3. Benchmarking: As a benchmark, we will use a simple popularity-based recommendation system, where the most popular books across the entire dataset are recommended to every user. The performance of our unsupervised model will be compared against this benchmark to demonstrate its effectiveness in providing personalized recommendations.


# Evaluation Metrics

The evaluation metric that we propose to quantify the performance of both the benchmark and solution models is the **Cluster-Based Rating Consistency (CBRC)**. This metric assesses the homogeneity of ratings within clusters and the alignment of recommendations with users' preferences.

**Mathematical Representation**:
$$
\text{CBRC} = \frac{1}{N_c} \sum_{i=1}^{N_c} \left( \frac{\sum_{j \in C_i} (r_j - \bar{r}_{C_i})^2}{|C_i|} \right)
$$
where:
- $N_c$ is the number of clusters,
- $C_i$ is the set of books recommended to the $i$th cluster,
- $r_j$ is the rating given to the $j$th book within cluster $C_i$,
- $bar{r}_{C_i}$ is the average rating of books within cluster $C_i$,
- $|C_i|$ is the number of books in cluster $C_i$.

The CBRC metric will allow us to evaluate the consistency of recommendations across the user clusters, with lower values indicating higher consistency and alignment with user preferences. By comparing the CBRC scores of the benchmark model and our solution model, we can quantitatively determine which model provides more relevant and personalized recommendations.

# Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination. Get creative!

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic (Pelé) | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (Beckenbaur)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data ,do some EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Cruyff) | Discuss/edit project code; Complete project |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
