### Exploratory Data Analysis on Steam Reviews

* [Import Libraries ](#section-1)
* [Read Data](#section-2)
* [Exploratory Data Analysis](#section-3)
    - [Summary Statistics, Missing values, Duplicates, etc.](#subsection-1)
    - [Plots](#subsection-2)
* [Word Clouds](#section-4)
 


<a id="section-1"></a>
### Import Libraries

In [82]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import collections
from collections import Counter
from wordcloud import WordCloud 

<a id="section-2"></a>
### Read Data

In [21]:
df_reviews = pd.read_csv('/kaggle/input/steam-reviews/dataset.csv')
df_reviews.head()

In [4]:
# size of dataframe
df_reviews.shape

<a id="section-3"></a>
### Exploratory Data Analysis

<a id="subsection-1"></a>
#### Summary Statistics

In [7]:
df_reviews.describe(include='all')

#### Missing values

In [9]:
# is there any missing values?
df_reviews.isnull().values.any()

In [10]:
# count of missing values by column
df_reviews.isnull().sum()

In [105]:
# percentage of missing values by column
round(df_reviews.isnull().mean(), 4) * 100

#### Duplicates

In [31]:
# how many duplicates?
df_reviews[df_reviews.duplicated()].shape[0]

In [32]:
# drop the duplicates
df_reviews = df_reviews.drop_duplicates(keep='first')

In [33]:
# new DF size
df_reviews.shape

#### Total number of games

In [47]:
df_reviews["app_name"].nunique()

#### Games with most reviews

In [46]:
df_reviews.groupby("app_name", as_index=False)["review_score"].count().rename(columns={'review_score':'Review Count'}).sort_values(by=["Review Count"], ascending=False).head(10)

<a id="subsection-2"></a>
### Plots

In [37]:
# Review Score
sns.countplot(x="review_score", data=df_reviews).set_title('Review Score')
plt.ticklabel_format(style='plain', axis='y')
plt.show()

In [38]:
# Review Votes
sns.countplot(x="review_votes", data=df_reviews).set_title('Review Votes')
plt.ticklabel_format(style='plain', axis='y')
plt.show()

#### Number of reviews per game

In [42]:
df_reviews.groupby("app_name", as_index=False)["review_score"].count().rename(columns={'review_score':'Review Count'}).plot(figsize=(14,6))
plt.show()

#### Getting the percentage of positive and negative reviews for each game

In [54]:
df_reviews_perc = df_reviews.groupby(["app_name", "review_score"])[["review_text"]].count().rename(columns={'review_text':'Percentage'}).groupby(level=[0]).apply(lambda g: 100*(round(g / g.sum(), 2))).reset_index()

In [57]:
# percentage of positive reviews
sns.displot(df_reviews_perc[df_reviews_perc["review_score"]==1]["Percentage"], kde=False, bins=10)
plt.title('Percentage of Positive Reviews')
plt.show()

In [59]:
# percentage of negative reviews
sns.displot(df_reviews_perc[df_reviews_perc["review_score"]==-1]["Percentage"], kde=False, bins=10)
plt.title('Percentage of Negative Reviews')
plt.show()

#### Games with less than 10% positive reviews

In [69]:
df_reviews_perc[(df_reviews_perc["review_score"]==1) & (df_reviews_perc["Percentage"] <= 10)].sort_values(by=["Percentage"], ascending=True).head(10)

#### Remember that the percentages are based on different number of reviews

In [74]:
df_reviews[(df_reviews["app_name"]=='Fray') | (df_reviews["app_name"]=='Drunk Wizards')].groupby("app_name", as_index=False)["review_score"].count().rename(columns={'review_score':'Review Count'})

In [64]:
# Number of Characters in each Review
sns.displot(df_reviews["review_text"].str.len(), kde=False, bins=15)
plt.ticklabel_format(style='plain', axis='y')
plt.title('Number of Characters in each Review')
plt.show()

#### Average number of characters in each review

In [66]:
round(df_reviews["review_text"].str.len().mean())

<a id="section-4"></a>
### Word Clouds

In [76]:
# convert review texts to string 
df_reviews["review_text"] = df_reviews["review_text"].astype(str)

In [99]:
# define the word cloud function
def WordCloud_generator(data, title=None):
    
    # Keep top 1000 most frequent words
    most_freq = Counter(data).most_common(1000) 
    text = ' '.join([x[0] for x in most_freq])
    
    wordcloud = WordCloud(width = 800, height = 800,
                          background_color ='white',
                          min_font_size = 10,
                          collocations=False
                         ).generate(text)

    # plot the Word Cloud                      
    plt.figure(figsize = (6, 6), facecolor = None) 
    plt.imshow(wordcloud, interpolation='bilinear') 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.title(title,fontsize=25)
    plt.show() 

In [94]:
# Most used words in reviews
WordCloud_generator(df_reviews["review_text"], title="Most used words in reviews")

In [96]:
# Most used words in positive reviews
WordCloud_generator(df_reviews[df_reviews["review_score"]==1]["review_text"], title="Most used words in Positive reviews")

In [97]:
# Most used words in negative reviews
WordCloud_generator(df_reviews[df_reviews["review_score"]==-1]["review_text"], title="Most used words in Negative reviews")

In [100]:
# Most used words in recommended reviews
WordCloud_generator(df_reviews[df_reviews["review_votes"]==1]["review_text"], title="Most used words in recommended reviews")