**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Benjamin Xia
- Kailey Wong
- Jesus Tello
- Thor
- Wasp

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)

How much of an impact do cryptocurrency miners/prices actually have on GPU prices and stock for individual customers (e.g. gamers)? Are there any other factors that have had a larger impact (e.g. global chip shortages affecting GPU's and other kinds of chips)?

## Background and Prior Work


Graphics Processing Units (a.k.a. GPUs, graphics cards, video cards) allow computers to perform parallel computations on a scale that is simply impossible on most traditional processing units (central processing units, a.k.a. CPUs). GPUs have allowed faster data processing, rendering, training for machine learning algorithms, and more. Gaming on a low-spec potato laptop often hinders the smooth play of demanding triple-A titles due to hardware limitations. To cope, users must reduce graphics settings, sacrificing visual quality for better performance. However, some games may still remain unplayable, necessitating the exploration of less demanding titles or an upgrade to a more powerful system. Many game developers are even expecting players to have powerful machines as an excuse not to optimize their games to perform well on lower-end machines properly.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) As computer graphics continue to progress and games become more graphically intensive, demand for consumer-grade GPUs such as NVIDIA's Geforce lineup has only grown.

When cryptocurrency became mainstream (around 2017), more people started mining cryptocurrencies with consumer-grade GPUs due to their unreasonable effectiveness compared to traditional processors and the lack of supply for ASIC (Application-Specific Integrated Circuit) units specialized for cryptocurrency mining.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Cryptocurrency miners, along with the onset of the pandemic led to GPU supplies dwindling, leading gamers to start blaming cryptocurrency miners for GPU shortages and sky-high pricing by eBay scalpers. Many GPU models, such as the NVIDIA GTX 1000 series and RTX 3000 series, were unavailable at MSRP (Manufacturer's Suggested Retail Price) for months. Gamers' reaction to the GPU market is evident by overall sentiment on online communities such as Reddit.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Vendors have attempted to allow more individual customers to get their hands on these precious GPU's by imposing purchase limits.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

The story of how GPU prices skyrocketed has since become a case study in how supply and demand can swiftly decimate a consumer market.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5)

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Crider, Michael (7 Sep 2023) Pre-Crypto Prices When? *PCWorld*. https://www.pcworld.com/article/2058969/trouble-running-starfield-todd-howard-says-upgrade-your-pc.html
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Iyer, S.G., Pawar, A. Dipakumar. (28 Feb 2019) GPU and CPU Accelerated Mining of Cryptocurrencies and their Financial Analysis *IEEE*. https://ieeexplore.ieee.org/document/8653733
3. <a name="cite_note-3"></a> [^](#cite_ref-3) u/rcmaehl (18 Nov 2018) Pre-Crypto Prices When? *r/pcmasterrace*. https://www.reddit.com/r/pcmasterrace/comments/9ygmux/precrypto_prices_when/
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Dent, Steve (11 Feb 2022) Best Buy Limits Sales of NVIDIA RTX-Series GPUs to Totaltech Subscribers. *EnGadget*. https://www.engadget.com/best-buy-gpu-sales-totaltech-membership-paywall-092357559.html
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Lim, H.W., Wibowo, T. (12 Apr 2022) Cryptocurrency Mining Effects On Semiconductor Shortage on PC Owner Community. *CoMBInES - Conference On Management, Business, Innovation, Education And Social Sciences*. https://journal.uib.ac.id/index.php/combines/article/view/6634

# Hypothesis


We suspect that cryptocurrency miners do have some impact on GPU prices and supply (in that there is some positive correlation between cryptocurrency prices and GPU prices), though its impact is often exaggerated by gamers on online communities as external factors such as chip (semiconductor) shortages often have a greater impact on price and supply. We believe this to be the case because many vendors often place limits on how many GPU's customers can purchase at at a time, and many of the GPU shortages have happened to coincide with global chip shortages (that were not isolated to GPU's).

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Steam Hardware Surveys
  - Link to the dataset: https://raw.githubusercontent.com/jdegene/steamHWsurvey/master/shs.csv
  - Number of observations: 34713
  - Number of variables: 5
- Dataset #2
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #3

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Steam Hardware Surveys

In [18]:
import pandas as pd

In [19]:
# Load the data
steam_survey = pd.read_csv('https://raw.githubusercontent.com/jdegene/steamHWsurvey/master/shs.csv')

# Check out the data
steam_survey.head()

Unnamed: 0,date,category,name,change,percentage
0,2008-11-01,AMD CPU Speeds,1.4 Ghz to 1.49 Ghz,-0.0004,0.0036
1,2008-11-01,AMD CPU Speeds,1.5 Ghz to 1.69 Ghz,-0.0025,0.0224
2,2008-11-01,AMD CPU Speeds,1.7 Ghz to 1.99 Ghz,-0.0024,0.0714
3,2008-11-01,AMD CPU Speeds,2.0 Ghz to 2.29 Ghz,-0.004,0.1343
4,2008-11-01,AMD CPU Speeds,2.3 Ghz to 2.69 Ghz,0.0001,0.0727


In [20]:
# Filter the survey responses by date
# We will only be looking at data from the last 8 years, or since 2015
# Reset the indexes
steam_survey = steam_survey[pd.to_datetime(steam_survey['date']).dt.year >= 2015].reset_index().drop(labels='index',axis=1)
steam_survey.head()

Unnamed: 0,date,category,name,change,percentage
0,2015-01-01,Free Hard Drive Space,10 GB to 99 GB,-0.0029,0.2012
1,2015-01-01,Free Hard Drive Space,100 GB to 249 GB,-0.0006,0.2468
2,2015-01-01,Free Hard Drive Space,250 GB to 499 GB,-0.0008,0.2608
3,2015-01-01,Free Hard Drive Space,500 GB to 749 GB,0.0015,0.1232
4,2015-01-01,Free Hard Drive Space,750 GB to 999 GB,0.002,0.0879


In [21]:
# What is the shape of this data set?
steam_survey.shape

(34713, 5)

## Dataset #2 (if you have more than one, use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

Our proposed data will mainly come from Steam surveys, company sales, general finance data, and other easily-accessible, and public information. Because the data from the Steam surveys is already anonymized, there should not be any ethical or privacy concerns with being able to extrapolate personal identifiable information from the survey data. However, this makes it difficult to account for any potential biases in the dataset as we can only guess at the demographic of Steam users who answered the survey. Additionally, in these types of surveys it is important to be aware of many different types of biases that may come into play, including non-response, self-reporting, and other confounding variables, which could lead to a disproportionate amount of high-end hardware appearing in survey results. One potential source of data we have discussed involves scraping data from websites, which may not be legal/allowed depending on the website. As we consider other potential data sources, being aware of terms of service and other restrictions will be an important factor in choosing our final datasets. Regarding company sales and finance data, we don’t anticipate any ethical or privacy related concerns in our use of the data since by nature it is anonymized. There is potential to be able to identify individual companies or investors through the data, one way to address this is simply aggregating data, since we only require a broader overview. There may be concerns as to how this data is acquired and any potential biases that may be introduced. To help detect these biases as we perform analysis of the data, we will be careful to thoroughly explore the data before deeper analysis, attempting to fully understand it and being careful to note factors that may indicate biases in the data. In our post-analysis, we will discuss the results of our exploration. If we suspect any biases, we will explain their significance and the impact on the results of our analysis.

# Team Expectations 

* Attend weekly meetings on Sunday 9pm (or have a good excuse).
* On weeks where some project component is due on Wednesday, attempt to have a working draft (e.g. 90% complete) by Monday, have everyone look over everything and revise on Tuesday, and finalize on Wednesday. Depending on the scale of a project component this timeline may be pushed forward.
* If you ghost meetings without saying anything or fail to make meaningful attempts to contribute, your name will not be added to the notebook corresponding to whatever project component is due, and there will be a note that you did not contribute underneath everyone else's names :(
* Don't sabotage other people's stuff >:(
* Don't plagiarize :)

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/23  | 4:10 PM | Review previous data analysis projects and think about strengths and weaknesses of each project | Discuss about previous projects to avoid making the same mistakes|
| 10/30  | 9:00 PM | Brainstorm topics/questions; Review the previous project to get some idea  | Determine best form of communication; Discuss and decide on data analysis topic; discuss hypothesis; begin background research |
| 11/05  | 9:00 PM |Finalize datasets we will use for our project | Discuss about data wrangling and possible analytical approaches; Descriptive Analysis for checkpoint 1 | 
| 11/12  |  9:00 PM |  Data Cleaning and preprocessing; Explores the data | draft data checkpoint;Assign group members to lead each specific part | 
| 11/19  | 9:00 PM  | Finalize and submit data checkpoint; Import & Wrangle Data; EDA  | Review/Edit wrangling/EDA; Discuss Analysis Plan; Draft EDA Checkpoint;   |
| 11/26  |9:00 PM  | Finalize EDA Checkpoint and submit; Begin Anlysis| Discuss/edit Analysis  |
| 12/03  | 9:00 PM  | Complete analysis; Draft results/conclusion/discussion; | Discuss/edit final project |
| 12/10  | 9:00 PM  | Make demo video| Turn in Final Prjoect & Video |