# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

- Merel van den Bos
- Alex Rivera
- Albert Aung
- Lillian Wood

# Abstract 

Individuals leave reviews on the popular online gaming storefront Steam every day. People often go into great detail reviewing their favorite or least favorite games, making this a ripe field for sentiment analysis. Our goal is to train a model to detect the sentiment of a Steam user's review and be able to relay if this person enjoyed the game or not, and to what extent. We will be training a model with a large labelled dataset of reviews. Success will be measured by how accurately the model is able to detect whether a review is positive or negative, and if the person recommended the game or not.

# Background
 
Steam is a highly-used online marketplace for P.C. videogames.  According to Statista, there are approximately 120 million people that are active montly on Steam in the year 2020 which demonstrates its wide reach. (https://www.statista.com/statistics/308330/number-stream-users/) The popularity of the platform has developed into a social feature by housing the P.C. gaming community and connecting friends through muutually-played games. An important feature of Steam, both as a marketplace and a social sphere, is the ability to write, read, and rate reviews for games. 

The plethora of review on Steam provides both an interesting and abundant source of text with potential to drive useful sentiment analysis models. In their paper, "Steam Review Dataset - new, large scale sentiment dataset," Sobkowicz and Stokowiec introduce a dataset which they claim could be a powerful source of consumer data for sentiment analysis <a name="Sobkowicz"></a>[<sup>[1]</sup>](#Sobkowicz). Utlizing these reviews has relevant and important implications. Since these reviews act as consumer data, conclusions drawn from these reviews inform both gamers and game developers about successful and worthwile games. Additionally, the function of reviewing is a very powerful feature, providing gamers with a voice that directly impacts which games sell and which games flop <a name="Sobkowicz"></a>[<sup>[1]</sup>](#Sobkowicz). It would be highly valuable for both development of future games and for satisfaction of gamers to tailor the review system to be effective and informative.

As it stands, reviews consist of both written commentary and a "Yes" (thumbs-up) or "No" (thumbs-down). The Yes and No ratings for each game are averaged to create two summative ratings which appear underneath the synopsis of the game when a user views the game's page. "Recent Reviews" averages the number of positive and negative recent reviews. "All" averages the number of all positive and negative reviews. The summative ratings are labeled on a scale of: overwhelmingly negative, very negative, negative, mostly negative, mixed, mostly positive, positive, very positive, overwhelmingly positive. 

Although this feedback is already incredibly helpful to gamers and game developers, it is questionable whether the data is entirely accurate. As they stand, the summative ratings are binary, only based on "Yes" or "No." There is no way to account for partially liking or disliking a game. Forcing users to choose between Yes or No may skew the ratings incorrectly, since there is no middle ground. Therefore, it would be useful to also summatively analyze the written data in order to develop a a more well-rounded summary of a game's reviews. 

We propose the most meaningful way to analyze and summarize a game's written reviews would be to perform a sentiment analysis to predict the rating of a game based on the written reviews.

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).
 - We are going to predict the rating of a game based on Steam reviews.

# Data

- Dataset: https://www.kaggle.com/datasets/andrewmvd/steam-reviews
- Description: The dataset contains over 6.4 million observations, which are publicly available reviews in English from the Steam Reviews portion of Steam store run by Valve. 5 variables describe each observation: Game id, Game Name, Review text, Review Sentiment: whether the game the review recommends the game or not, and Review vote: whether the review was recommended by another user or not.
- Some critical variables are the Review text, Review sentiment, and Review vote. Review text will be string data. Review sentiment is coded -1 as negative and 1 as positive review. Review vote is coded 0 as not recommended and 1 as recommended.
- Review sentiment and Review vote are already in numerical values, which alleviates cleaning. 

# Proposed Solution

The solution to the problem we're trying to solve is sentiment analysis since it deals with evaluating user ratings on Steam since sentiment analysis looks into studying texts and analyzing them to classify text which in our case would be binary (i.e. whether a review is positive or negative). With this information from applying sentiment analysis, we can make predictions on the ratings of each game. To do this, we will be applying different models to see which model allows us to optimize the prediction of game ratings. To do sentiment analysis, we will have to pre-process the data to reduce noise, dimensionality to improve the efficiency of the machine learning models. Some ways we look to do this by cleaning the data by switching all the words into lowercase words, removing numbers, removing stopwords and removing punctuation. 

One way we can perform sentiment analysis is by using a Support Vector Machine (SVM) in order to create a distinction between reviews that appear positive and reviews that appear negative. This model will then allow us to automatically categorize reviews without further user input. A Support Vector Machine will require a kernel for tuning. In this case, we will most likely choose a linear kernel, as we are trying to decide between positive and negative reviews. This can be done using the scikit-learn library.

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

Since the problem we are tackling is a classifcation problem (i.e. whether a rating is positive or negative), our evaluation metrics in relation to sentiment analysis will include the following: precision, recall, f-score and accuracy. Accuracy or more specifically classification accuracy can be determined by the formula (Accuracy = Number of Correct Predictions / Total number of predictions made). This measures the correctness of predictions as suggested by the formula. An equation that envelopes both precision and recall is the calculation of the F1 score which entails (F1 = 2 * 1/(1/Precision + 1/Recall)). The F1 Score tells us how precise (preicison) and how error-less our model is (recall). A high amount of precision and low amount of recall can lead to a significant number of missing instances and a low amount of precision but high amount of recall shows us inaccurate the data is but it does not miss a significant number of instances. The F1 score which ranges from [0,1] calculates and tries to tell us the balance between precision and recall. The prediction formula is given by (Precision = Number of True Positives / ( Number of True Positives + Number of False Positives)) and tells us the number of correct positive results over the number of positive results predicted by the model. The recall formula is given by ( Number of True Positives / Number of True Positives + Number of False Negatives) and tells us the number of correct positive results over the number of all samples that should have identified as positive. 

Source: https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234

# Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

# Team Expectations 

Expectations:

Everyone is to be treated with respect.

Everyone is expected to have an open line of communication so that everyone's on the same page of where we are in the project and how we are looking to accomplish certain tasks for each week.

Everyone is allowed to work in their own ways (whether working individually, joining group calls or meeting up online/in person) as long as they meet the deadline that is set. We are aware that everyone has their own set of schedule and are at different stages in our lives and have other things that may keep us busy so we will be flexible with how we work on the project as long as the project is done by the set timeline which is usually before the day/time of the project deadline.

When someone has a hard time making their deadline, we will look to communicate about it early and upfront. This allows the other members to be aware of the team member's situation and the other team members can look to  help each out.


What we have done so far:

We communicate through Discord where we either write each other and/or make calls, whatever is necessary for the discussion we need to have. We will also try to meet on campus in person in a while and work on the project together. 

We all respect each others way of working and we reach a consensus in good consultation. Also, we try to take everyon'e's preferences considering topics for instance and so far this has worked smoothly for us. 

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/18  |  7 PM |  Brainstorm topics/questions (all)  | Discuss and decide topic for the project proposal; Let everyone know when you can work on it; Set deadlines | 
| 4/24 (online chat)  |  10 PM |  Put in the file what was discussed | Help each other out where help is needed (all); Split up the work that is left | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (Beckenbaur)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data ,do some EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Cruyff) | Discuss/edit project code; Complete project |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="sobkowicznote"></a>1.[^](#sobkowicz): Sobkowicz, Antoni & Stokowiec, Wojciech. (2016). Steam Review Dataset - new, large scale sentiment dataset. https://www.researchgate.net/publication/311677831_Steam_Review_Dataset_-_new_large_scale_sentiment_dataset<br> 
<a name="statistanote"></a>2.[^](#statista): Statista.com. (2021). Number of peak concurrent Steam users from January 2013 to September 2021. https://www.statista.com/statistics/308330/number-stream-users/<br>
<a name="sotanote"></a>3.[^](#sota): Steam.com. (2022). https://store.steampowered.com/about/<br> 