<a href="https://colab.research.google.com/github/pschorey/Valpo_IT533/blob/main/01_Processing_and_Association.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pedagogy Lab 1 (Exploring Chi Square: Data Processing and Association)

>You will be working with the [STEAM Games dataset](https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/steam_games.csv) (a cleaned and modified subset of this dataset on [Kaggle](https://www.kaggle.com/datasets/tristan581/all-55000-games-on-steam-november-2022?resource=download&select=steam_games.csv). This is a PARTIAL inventory of the online PC gaming store STEAM. Parts of it were collected on the 8th of November 2022, using the STEAM store API as well as the API of another website, steamspy.com. It contains useful information, such as the products genre, tags, categories, languages and other information.
>
> The features of this dataset are
* AppID: ID of product as allocated by Steam.
* Name: Product name
* Developer: Whoever created the product.
* Publisher: Whoever published the product.
* Genre: The genre(s) that the product is in.
* Categories: The categories/features that the product has.
* Owners: An approximate number of owners, according to Steam Spy.
* Positive_Reviews: The number of positive reviews the product has.
* Negative_Reviews: The number of negativereviews the product has.
* Price_\$: The price of the game in USD.
* Initial_Price_$: The price of the game in USD at launch.
* Discount_%: What percentage sale the product was off by as of 2022/11/8.
* Peak_Concurrent_Players: Peak concurrent players as of 2022/11/8.
* Platforms: What operating systems the product is available on.
* Release Date: When the product was first released.
* Required Age: Whether the user needs to be over a certain age to legally purchase (0 = no restrictions)
* Avg_Rating_5: Average rating from 1 to 5 stars
* Playability_index_10: Average playability index on a scale from 1-10

* Source: Professor Sonja Streuber 2023S1R-IT-533-STEM:Data Mining and Applications

In [14]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

steam = pd.read_csv("https://raw.githubusercontent.com/pschorey/Valpo_IT533/main/steam_games_short.csv")


# Inspect the data


In [15]:
steam.head()

Unnamed: 0,AppID,Name,Developer,Publisher,Genre,Categories,Owners,Positive_Reviews,Negative_Reviews,Price_USD,Initial_Price_USD,Discount_Percent,Peak_Concurrent_Players,Platforms,Release_Date,Required_Age,Avg_Rating_5,Playability_Index_10
0,10,Counter-Strike,Valve,Valve,Action,"Multi-player, Valve Anti-Cheat enabled, Online...","10,000,000 .. 20,000,000",201215,5199,9.99,9.99,0,13990,"windows, mac, linux",11/1/2000,0,3.87,6.54
1,20,Team Fortress Classic,Valve,Valve,Action,"Multi-player, Valve Anti-Cheat enabled, Online...","1,000,000 .. 2,000,000",5835,934,4.99,4.99,0,101,"windows, mac, linux",4/1/1999,0,3.92,8.15
2,30,Day of Defeat,Valve,Valve,Action,"Multi-player, Valve Anti-Cheat enabled","5,000,000 .. 10,000,000",5251,569,4.99,4.99,0,142,"windows, mac, linux",5/1/2003,0,4.9,7.25
3,40,Deathmatch Classic,Valve,Valve,Action,"Multi-player, Valve Anti-Cheat enabled, Online...","5,000,000 .. 10,000,000",1961,437,4.99,4.99,0,3,"windows, mac, linux",6/1/2001,0,4.69,8.12
4,50,Half-Life: Opposing Force,Gearbox Software,Valve,Action,"Multi-player, Single-player, Valve Anti-Cheat ...","5,000,000 .. 10,000,000",14887,756,4.99,4.99,0,106,"windows, mac, linux",11/1/1999,0,2.84,2.35


# Describe the data

In [16]:
steam.describe()

Unnamed: 0,AppID,Positive_Reviews,Negative_Reviews,Price_USD,Initial_Price_USD,Discount_Percent,Peak_Concurrent_Players,Required_Age,Avg_Rating_5,Playability_Index_10
count,11126.0,11126.0,11126.0,11126.0,11126.0,11126.0,11126.0,11126.0,11126.0,11126.0
mean,511425.4,4599.697,655.332914,8.549563,8.865782,3.125742,378.9032,0.385853,3.508388,5.006447
std,561458.7,66290.48,9050.343291,9.223482,9.293126,14.63596,12008.136002,2.444761,0.864208,2.88624
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
25%,281297.5,22.0,9.0,2.99,2.99,0.0,0.0,0.0,2.76,2.51
50%,365835.0,110.5,43.0,5.99,6.99,0.0,0.0,0.0,3.52,5.05
75%,448447.5,619.75,173.0,9.99,9.99,0.0,4.0,0.0,4.26,7.48
max,2190950.0,5943345.0,787093.0,299.9,299.9,90.0,874053.0,18.0,5.0,10.0


# Chi-Square

Chi Square calculates the relationship between Observed and Expected Values between two CATEGORICAL attributes.

**THINGS TO REMEMBER** about a Chi Square test:
1. Is a hypothesis test based on categorical attributes.
2. Uses as its H0 that the two variables under investigation are independent
3. Uses a chi square table
3. If the resulting p-value is > 0.05, both variables are independent
4. If the resulting p-value is < 0.05, both variables are dependent

* Source: Professor Sonja Streuber 2023S1R-IT-533-STEM:Data Mining and Applications



# Looking at our Steam data there aren't many categories!
We are going to create a category column based on average rating:


*   rating > 4 : "Excellent"
*   4 > rating > 3 : "Good"
*   3 > rating > 2 : "Average"
*   2 > rating : "Bad"



In [17]:
bins = [0, 2, 3, 4]
labels = ['Bad', 'Average', 'Good', 'Excellent']
d = dict(enumerate(labels, 1))
steam['Rating_Labels'] = np.vectorize(d.get)(np.digitize(steam['Avg_Rating_5'], bins))
#above code from https://stackoverflow.com/questions/49382207/how-to-map-numeric-data-into-categories-bins-in-pandas-dataframe 

#check our data to see if it worked...
steam[['Rating_Labels', 'Avg_Rating_5']]


Unnamed: 0,Rating_Labels,Avg_Rating_5
0,Good,3.87
1,Good,3.92
2,Excellent,4.90
3,Excellent,4.69
4,Average,2.84
...,...,...
11121,Excellent,4.06
11122,Average,2.87
11123,Good,3.79
11124,Average,2.89


Using our Steam data, let's see if there is a relation between Rating_Labels and Genre.  

Our hypothesis is that game genre and reviews are dependent.

The null-hypothesis is that game genre does not affect review ratings.

Which statement is likely to be true?

In [18]:
chi2, p, dof, expected = chi2_contingency((pd.crosstab(steam.Rating_Labels, steam.Genre).values))
print (f'Chi-square Statistic: {chi2} ,p-value: {p}, Degrees of Freedom: {dof}')

Chi-square Statistic: 1411.890703428438 ,p-value: 0.48080185981389695, Degrees of Freedom: 1410


Our p-value is 0.481, which is > 0.05.  This means that there isn't an association between reviews and game genre! How dissapointing, the null-hypothesis is true.

Maybe you can find a better association?

# Now it's your turn!
Create a hypothesis using two categories from the Steam data and write it here:

**My Hypothesis is that....**

In [None]:
#Test your hypothesis using chi2_contingency here



Summarize the results here:

**The results from my test indicate that....**