# STAT 207 Group Lab Assignment 2 - [10 total points]

## Finding Missing Data and Answering Questions

<hr>

## <u>Purpose</u>:
You should work in groups of 2-3 on this report (not working in groups without permission will result in a point deduction). The purpose of this group lab assignment is to understand and wrangle data, so that it would be ready and prepared for future analysis.
<hr>

## <u>Assignment Instructions</u>:

### Contribution Report
These contribution reports should be included in all group lab assignments. In this contribution report below you should list of the following:
1. The netID for the lab submission to be graded.  (Some groups have each member create their own version of the document, but only one needs to be submitted for grading.  Other groups have only one member compose and submit the lab.)
2. Names and netIDs of each team member.
3. Contributions of each team member to report.
4. **For this assignment, add in whether each of you use the gaming hub Steam.  If you do, include your favorite game through Steam.  If not, indicate whether you play any online games (and if so, which one).**

*For example:*

*<u>Teammates:</u>*

*doe105 should be graded.  John Smith (smith92) & Jane Doe (doe105) worked together on all parts of this lab assignment.  Both use Steam.  Favorite games are Jackbox Games for John and Grand Theft Auto for Jane.*, 

OR

*doe105 should be graded*

*1. John Smith (smith92): did number 1,2 and 3; Jackbox Games through Steam*

*2. Jane Doe (doe105): did number 4, and 5; Grade Theft Auto through Steam*

junhao4 should be graded, kieranc3 also worked with us 


### Group Roles

You are expected to work in groups of 2-3 on this report.  Since you are working in groups, you may find it helpful to have specified roles.  These roles will likely be helpful in later labs.  Below, I provide roles that can be used for groups of 2 and for groups of 3.  I encourage you to switch roles within this lab report, as possible.  I also encourage you to switch roles for each subsequent lab, as possible, based on your group membership.  

#### Groups of 2

* **Driver**: This student will type the report.  While typing the report, you may be the one who is selecting the functions to apply to the data.
* **Navigator**: This student will guide the process of answering the question.  Specific ways to help may include: outlining the general steps needed to solve a question (providing the overview), locating examples within the course notes, and reviewing each line of code as it is typed.

#### Groups of 3

* **Driver**: This student will type the report.  They may also be the one to select the functions to apply to the data.
* **Navigator**: This student will guide the process of answering the question.  They may select the general approach to answering the question and/or a few steps to be completed along the way. 
* **Communicator**: This student will review the report (as it is typed) to ensure that it is clear and concise.  This student may also locate relevant examples within the course notes that may help complete the assignment.

<hr>

### Imports

In [31]:
#Run this
import pandas as pd                    # imports pandas and calls the imported version 'pd'
import matplotlib.pyplot as plt        # imports the package and calls it 'plt'
import seaborn as sns                  # imports the seaborn package with the imported name 'sns'
sns.set()  

## Steam Data

Steam is the world's most popular PC Gaming hub. They have a massive catalog of games, with everything from AAA blockbusters to small indie titles.  

You are a team of data scientists working for Steam, and you are responsible for completing analysis for a report that will go to the executives of Valve (the parent company for Steam).  

Unfortunately, the data were collected with a less than optimal structure.  The dataset is comprised of transactions within the Steam platform for a random sample of 500 steam users along with their purchase and game play behaviors.  It has the following columns:
* user_id,
* game_name,
* activity:
    - purchase: indicating that the user has *purchased* the corresponding game
    - play: indicating that the user has *played* the corresponding game (for at least some amount of time.) 
* hours_played_if_play:
    - if the row corresponds to a 'play' activity, this number represents the number of hours the user has played the game
    - if the row corresponds to a 'purchase' activity, this number is always a 1 (and means nothing... it's a placeholder).

Note that for each user-game combo, there will either be one row (if the game has only been purchased but not played) or two rows (if the game has been both purchased and played).  For example, if I bought Portal, a row would be added to the data representing the purchase of that game by me.  When I open the game and start playing it, a second row would be added to the data representing that I have played the game along with recording my play time for that specific game.

### 1. [1 point] Reading in the Data

First, read the steam_sample.csv file into a dataframe.  Display the first five rows and the number of observations in the data.

Be sure to note the structure of the data (especially how the first two rows relate to each other) when observing the data.

In [32]:
df = pd.read_csv("steam_sample.csv")

In [33]:
df.head()

Unnamed: 0,user_id,game_name,activity,hours_played_if_play
0,308653033,Unturned,purchase,1.0
1,308653033,Unturned,play,0.6
2,308653033,theHunter,purchase,1.0
3,144004384,Dota 2,purchase,1.0
4,144004384,Dota 2,play,22.0


### 2. [3 points] Preparing the Data

In order to prepare the data effectively, perform the following steps:

1. Identify any values (if any) that have been encoded in the csv to represent a **missing value**.  
2. Make sure that Python reads these missing values correctly.  
3. Report the number of rows that have missing values and the proportion of rows with missing values.
4. Drop any observations with missing values.

In [34]:
df.isna().sum()

user_id                 0
game_name               0
activity                0
hours_played_if_play    0
dtype: int64

In [35]:
df.dtypes

user_id                  int64
game_name               object
activity                object
hours_played_if_play    object
dtype: object

In [36]:
df["hours_played_if_play"].unique()

array(['1', '0.6', '22', '1028', '1008', '148', '108', '72', '36', '35',
       '32', '21', '16', '15.8', '8.6', '7.8', '7.3', '3.1', '1.9', '1.7',
       '1.1', '0.4', '153', '63', '26', '1.4', '639', '479', '70', '65',
       '33', '30', '19.8', '16.2', '11.3', '4.2', '3.9', '2.3', '0.8',
       '0.7', '0.5', '0.3', '396', '227', '13.4', '12.6', '11.2', '10.1',
       '2.4', '210', '1.2', '0.2', '13.2', '48', '110', '0.1', '67',
       '429', '5.5', '61', '1.6', '18.3', '9.9', '4.7', '1714', '441',
       '197', '147', '117', '86', '73', '49', '46', '31', '24', '20',
       '19.7', '18.2', '14.2', '11.6', '9.8', '9.7', '8.4', '6.5', '4.8',
       '4.1', '3.7', '2.9', '2.7', '2.1', '222', 'unknown', '14.9',
       '14.1', '83', '11.1', '3.2', '6.9', '395', '251', '9.3', '7.4',
       '54', '34', '1.8', '99', '98', '96', '29', '27', '23', '19.1',
       '18.7', '17.5', '17', '16.6', '14.7', '13.3', '10.7', '10.2', '10',
       '9.6', '9.5', '8', '7.9', '7.6', '7', '6.6', '6', '5.8', '5

In [37]:
df_copy = df.copy()

In [38]:
df_copy["hours_played_if_play"][df_copy["hours_played_if_play"] == "unknown"] = None

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy["hours_played_if_play"][df_copy["hours_played_if_play"] == "unknown"] = None


In [39]:
df_copy["hours_played_if_play"].unique()

array(['1', '0.6', '22', '1028', '1008', '148', '108', '72', '36', '35',
       '32', '21', '16', '15.8', '8.6', '7.8', '7.3', '3.1', '1.9', '1.7',
       '1.1', '0.4', '153', '63', '26', '1.4', '639', '479', '70', '65',
       '33', '30', '19.8', '16.2', '11.3', '4.2', '3.9', '2.3', '0.8',
       '0.7', '0.5', '0.3', '396', '227', '13.4', '12.6', '11.2', '10.1',
       '2.4', '210', '1.2', '0.2', '13.2', '48', '110', '0.1', '67',
       '429', '5.5', '61', '1.6', '18.3', '9.9', '4.7', '1714', '441',
       '197', '147', '117', '86', '73', '49', '46', '31', '24', '20',
       '19.7', '18.2', '14.2', '11.6', '9.8', '9.7', '8.4', '6.5', '4.8',
       '4.1', '3.7', '2.9', '2.7', '2.1', '222', None, '14.9', '14.1',
       '83', '11.1', '3.2', '6.9', '395', '251', '9.3', '7.4', '54', '34',
       '1.8', '99', '98', '96', '29', '27', '23', '19.1', '18.7', '17.5',
       '17', '16.6', '14.7', '13.3', '10.7', '10.2', '10', '9.6', '9.5',
       '8', '7.9', '7.6', '7', '6.6', '6', '5.8', '5.6', 

In [40]:
df = pd.read_csv("steam_sample.csv", na_values = ["unknown"])

In [41]:
df.dropna()

Unnamed: 0,user_id,game_name,activity,hours_played_if_play
0,308653033,Unturned,purchase,1.0
1,308653033,Unturned,play,0.6
2,308653033,theHunter,purchase,1.0
3,144004384,Dota 2,purchase,1.0
4,144004384,Dota 2,play,22.0
...,...,...,...,...
7801,99096740,SimCity 4 Deluxe,play,0.2
7802,99096740,BioShock Infinite Burial at Sea - Episode 2,purchase,1.0
7803,99096740,The Elder Scrolls V Skyrim - Dawnguard,purchase,1.0
7804,99096740,The Elder Scrolls V Skyrim - Dragonborn,purchase,1.0


In [42]:
df["hours_played_if_play"].unique()

array([1.0000e+00, 6.0000e-01, 2.2000e+01, 1.0280e+03, 1.0080e+03,
       1.4800e+02, 1.0800e+02, 7.2000e+01, 3.6000e+01, 3.5000e+01,
       3.2000e+01, 2.1000e+01, 1.6000e+01, 1.5800e+01, 8.6000e+00,
       7.8000e+00, 7.3000e+00, 3.1000e+00, 1.9000e+00, 1.7000e+00,
       1.1000e+00, 4.0000e-01, 1.5300e+02, 6.3000e+01, 2.6000e+01,
       1.4000e+00, 6.3900e+02, 4.7900e+02, 7.0000e+01, 6.5000e+01,
       3.3000e+01, 3.0000e+01, 1.9800e+01, 1.6200e+01, 1.1300e+01,
       4.2000e+00, 3.9000e+00, 2.3000e+00, 8.0000e-01, 7.0000e-01,
       5.0000e-01, 3.0000e-01, 3.9600e+02, 2.2700e+02, 1.3400e+01,
       1.2600e+01, 1.1200e+01, 1.0100e+01, 2.4000e+00, 2.1000e+02,
       1.2000e+00, 2.0000e-01, 1.3200e+01, 4.8000e+01, 1.1000e+02,
       1.0000e-01, 6.7000e+01, 4.2900e+02, 5.5000e+00, 6.1000e+01,
       1.6000e+00, 1.8300e+01, 9.9000e+00, 4.7000e+00, 1.7140e+03,
       4.4100e+02, 1.9700e+02, 1.4700e+02, 1.1700e+02, 8.6000e+01,
       7.3000e+01, 4.9000e+01, 4.6000e+01, 3.1000e+01, 2.4000e

In [43]:
df.dtypes

user_id                   int64
game_name                object
activity                 object
hours_played_if_play    float64
dtype: object

### 3. [1 point] Interpreting Missing Data

We dropped observations above that had any missing values.  Are you concerned that the decision to drop observations as the way to handle missing data may not have been best?  Briefly explain.

Yes, because we are missing people who purchased the a particular game but haven't opened/played it yet. So the data may be inaccurate.

### 4. [1 point] Separate the Data

In this analysis, we would like to answer questions based on purchases and based on play time.  Create two new dataframes:

- one that is comprised of the purchase rows for the games that were purchased
- one that is comprised of the play rows for the games that were played

In [44]:
df_purchase = df[df.activity == "purchase"]
df_purchase

Unnamed: 0,user_id,game_name,activity,hours_played_if_play
0,308653033,Unturned,purchase,1.0
2,308653033,theHunter,purchase,1.0
3,144004384,Dota 2,purchase,1.0
5,54103616,Counter-Strike Global Offensive,purchase,1.0
7,54103616,Counter-Strike,purchase,1.0
...,...,...,...,...
7800,99096740,SimCity 4 Deluxe,purchase,1.0
7802,99096740,BioShock Infinite Burial at Sea - Episode 2,purchase,1.0
7803,99096740,The Elder Scrolls V Skyrim - Dawnguard,purchase,1.0
7804,99096740,The Elder Scrolls V Skyrim - Dragonborn,purchase,1.0


In [45]:
df_play = df[df.activity == "play"]
df_play

Unnamed: 0,user_id,game_name,activity,hours_played_if_play
1,308653033,Unturned,play,0.6
4,144004384,Dota 2,play,22.0
6,54103616,Counter-Strike Global Offensive,play,1028.0
8,54103616,Counter-Strike,play,1008.0
10,54103616,Left 4 Dead,play,148.0
...,...,...,...,...
7793,99096740,Crysis,play,5.3
7795,99096740,Assassin's Creed II,play,2.7
7797,99096740,Hitman Blood Money,play,1.3
7799,99096740,The Binding of Isaac Rebirth,play,0.7


### 5. [2 points] Game Play Time Question

Review the amount of time played for each row (user-game combination); you can choose your favorite method to review these amounts.  Are there any of these values that you suspect may not be accurate?  Explain.  

Do you have any questions about the game play time variable?

In [46]:
purchase_mean = df_purchase["hours_played_if_play"].mean()
purchase_mean

1.0

In [47]:
play_mean = df_play["hours_played_if_play"].mean()
play_mean

56.7249651810585

We think it is not accurate because the numbers is not reasonable range with people who purchased or played the game on steam. It makes no sense to have 1.0 hours for someone who just got the game because they haven't started compared to a number of 0.6 hours for someone who already played.

### 6. [2 points] Purchasing to Playing?

Overall, of all of the purchases represented in this data, what proportion have been played?  Of all the purchases in the data, what proportion remain "unopened", that is, unplayed?

What factors do you anticipate might affect whether a given game is played or not after being purchased?

In [48]:
len(df_play) / len(df) * 100

36.83064309505509

In [49]:
len(df_purchase) / len(df) * 100

63.16935690494492

A game can be on sale and it could be a very good price for someone. So they get it beforehand to make sure they have the game at a good price and can play whenever they want. 