### LSE Data Analytics Online Career Accelerator 

# TURTLE GAMES

### Questions to answer:
Turtle Games, a game manufacturer and retailer. They manufacture and sell their own products, along with sourcing and selling products manufactured by other companies. Their product range includes books, board games, video games and toys. They have a global customer base and have a business objective of improving overall sales performance by utilising customer trends. In particular, Turtle Games wants to understand: 
- how customers accumulate loyalty points.
- how useful are remuneration and spending scores data
- can social data (e.g. customer reviews) be used in marketing campaigns
- what is the impact on sales per product
- the reliability of the data (e.g. normal distribution, Skewness, Kurtosis)
- if there is any possible relationship(s) in sales between North America, Europe, and global sales.

## How customers accumulate loyalty points?

The marketing department of Turtle Games prefers Python for data analysis. Data analysis of social media data were made. The marketing department wants to better understand how users accumulate loyalty points. Therefore, we investigate the possible relationships between the loyalty points, age, remuneration, and spending scores. 

## Instructions
1. Load and explore the data.
    1. Create a new DataFrame (e.g. reviews).
    2. Sense-check the DataFrame.
    3. Determine if there are any missing values in the DataFrame.
    4. Create a summary of the descriptive statistics.
2. Remove redundant columns (`language` and `platform`).
3. Change column headings to names that are easier to reference (e.g. `renumeration` and `spending_score`).
4. Save a copy of the clean DataFrame as a CSV file. Import the file to sense-check.
5. Use linear regression and the `statsmodels` functions to evaluate possible linear relationships between loyalty points and age/renumeration/spending scores to determine whether these can be used to predict the loyalty points.
    1. Specify the independent and dependent variables.
    2. Create the OLS model.
    3. Extract the estimated parameters, standard errors, and predicted values.
    4. Generate the regression table based on the X coefficient and constant values.
    5. Plot the linear regression and add a regression line.
6. Include your insights and observations.

## 1. Load and explore the data

In [None]:
# Imports libraries needed for analysis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm 
from statsmodels.formula.api import ols

In [None]:
# Load the turtle_reviews.csv as reviews dataframe.
reviews = pd.read_csv('turtle_reviews.csv')

# View the dataframe.
reviews.head()

In [None]:
# Check the size of the dataframe.
reviews.shape

In [None]:
# Check if there are any missing data presented in the dataframe.
reviews.isnull().sum()

In [None]:
# Explore the data.
reviews.info()

In [None]:
# Explore descriptive statistics for numeric data.
reviews.describe()

## 2. Drop columns

In [None]:
# Remove redundant columns (language and platform).
reviews = reviews.drop(['language', 'platform'], axis=1)

# View column names and chck the shape of the dataframe.
print(reviews.shape)
reviews.head()

## 3. Rename columns

In [None]:
# Renaming the columns ("remuneration (k£)" and "spending_score (1-100)") headers.
reviews.rename(columns={"remuneration (k£)":"remuneration", "spending_score (1-100)":"spending_score"}, inplace=True)

# View column names.
reviews.info()

## 4. Save the DataFrame as a CSV file

In [None]:
# Create a CSV file as output.
reviews.to_csv("reviews.csv", index=False)

In [None]:
# Import new CSV file with Pandas.
reviews = pd.read_csv('reviews.csv')

# View the DataFrame.
reviews.head()

In [None]:
# Sense check the dataframe.
reviews.info()

## 5. Linear regression

Let's evaluate possible linear relationships between loyalty points and spending scores/remuneration/age to determine whether these can be used to predict the loyalty points?

### 5a) spending vs loyalty

In [None]:
# Dependent variable (y) is loyalty points.
y = reviews['loyalty_points'] 

# Independent variable (x) is spending score.
x = reviews['spending_score']

# Create formula and pass through OLS methods.
f = 'y ~ x'
test = ols(f, data = reviews).fit()

# Print the regression table.
test.summary() 

R 2 : 45.2% of the total variability of y (loyalty points the customers have), is explained by the variability of X (how much they spent or spending score). X: The coefficient of X describes the slope of the regression line or how much the response variable y change when X changes by 1 unit. So, if the spending score that the customer has (X) changes by 1 unit the loyalty (y) will change by 33.0617 units. The t-value tests the hypothesis that the slope is significant or not. If the corresponding probability is small (typically smaller than 0.05) the slope is significant. In this case, the probability of the t-value is zero, thus the estimated slope is significant. The last two numbers describe the 95% confidence interval of the true x coefficient, i.e. the true slope. For instance, if you take a different sample, the estimated slope will be slightly different. If you take 100 random samples each of 500 observations of X and y, then 95 out of the 100 samples will derive a slope that is within the interval (31.464, 34.659).

In [None]:
# Extract the estimated parameters.
print("Parameters: ", test.params) 

# Extract the standard errors.
print("Standard errors: ", test.bse)  

# Extract the predicted values.
print("Predicted values: ", test.predict()) 

In [None]:
# Set the X coefficient and the constant to generate the regression table.
y_pred = -75.052663 + 33.061693 * reviews['spending_score']

# View the output.
y_pred

In [None]:
# Plot the data points with a scatterplot.
plt.scatter(x, y)

# Plot the regression line (in black).
plt.plot(x, y_pred, color='red')

# Set the x and y limits on the axes.
plt.xlim(0)
plt.ylim(0)

# View the plot.
plt.show()

### 5b) remuneration vs loyalty

In [None]:
# Dependent variable (y) is loyalty points.
y = reviews['loyalty_points'] 

# Independent variable (x) is remuniration.
x = reviews['remuneration']

# Create formula and pass through OLS methods.
f = 'y ~ x'
test = ols(f, data = reviews).fit()

# Print the regression table.
test.summary() 

R 2 : 38.0% of the total variability of y (loyalty points the customers have), is explained by the variability of X (remuniration). X: The coefficient of X describes the slope of the regression line or how much the response variable y change when X changes by 1 unit. So, if the remuniration of the customer has (X) changes by 1 unit the loyalty point (y) will change by 34.1878 units. The t-value tests the hypothesis that the slope is significant or not. If the corresponding probability is small (typically smaller than 0.05) the slope is significant. In this case, the probability of the t-value is zero, thus the estimated slope is significant. The last two numbers describe the 95% confidence interval of the true xcoefficient, i.e. the true slope. For instance, if you take a different sample, the estimated slope will be slightly different. If you take 100 random samples each of 500 observations of X and y, then 95 out of the 100 samples will derive a slope that is within the interval (32.270, 36.106).

In [None]:
# Extract the estimated parameters.
print("Parameters: ", test.params) 

# Extract the standard errors.
print("Standard errors: ", test.bse)  

# Extract the predicted values.
print("Predicted values: ", test.predict()) 

In [None]:
# Set the X coefficient and the constant to generate the regression table.
y_pred = -65.686513 + 34.187825 * reviews['remuneration']

# View the output.
y_pred

In [None]:
# Plot the data points with a scatterplot.
plt.scatter(x, y)

# Plot the regression line (in black).
plt.plot(x, y_pred, color='red')

# Set the x and y limits on the axes.
plt.xlim(0)
plt.ylim(0)

# View the plot.
plt.show()

### 5c) age vs loyalty

In [None]:
# Dependable variable (x) is loyalty points.
y = reviews['loyalty_points'] 

# Independent variable (x) is age.
x = reviews['age']

# Create formula and pass through OLS methods.
f = 'y ~ x'
test = ols(f, data = reviews).fit()

# Print the regression table.
test.summary() 

R 2 : Only 2.0% of the total variability of y (loyalty points the customers have), is explained by the variability of X (customer age). X: The coefficient of X describes the slope of the regression line or how much the response variable y change when X changes by 1 unit. So, if the age of the customer has (X) changes by 1 unit the loyalty point (y) will change by -4.0128 units. The t-value tests the hypothesis that the slope is significant or not. In this case, t-value is 0.058, thus the estimated slope is not significant.

In [None]:
# Extract the estimated parameters.
print("Parameters: ", test.params) 

# Extract the standard errors.
print("Standard errors: ", test.bse)  

# Extract the predicted values.
print("Predicted values: ", test.predict()) 

In [None]:
# Set the X coefficient and the constant to generate the regression table.
y_pred = 1736.517739 - 4.012805 * reviews['age']

# View the output.
y_pred

In [None]:
# Plot the data points with a scatterplot.
plt.scatter(x, y)

# Plot the regression line (in black).
plt.plot(x, y_pred, color='red')

# Set the x and y limits on the axes.
plt.xlim(0)
plt.ylim(0)

# View the plot.
plt.show()

### 5d) Multiple Linear Regression (MLR): remuneration and spending score vs loyalty. 

In [None]:
# Import all the necessary packages.
import statsmodels.stats.api as sms
import sklearn
from sklearn import linear_model
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression

import warnings  
warnings.filterwarnings('ignore')  

In [None]:
# Define the dependent variable y loyalty points.
y = reviews['loyalty_points']  

# Define the independent variables X.
X = reviews[['remuneration', 'spending_score']]

In [None]:
# Fit the regression model.
multi = LinearRegression()  
multi.fit(X, y)

In [None]:
# Call the predictions for X (array).
multi.predict(X)

In [None]:
# Checking the value of R-squared, intercept and coefficients in the model.
print("R-squared: ", multi.score(X, y))
print("Intercept: ", multi.intercept_)
print("Coefficients:")

list(zip(X, multi.coef_))

In [None]:
# Make predictions. Create a variales new_remuneration and new_spending_score.
new_remuneration = 22.96
new_spending_score = 61
new_age = 37
print ('Predicted Value: \n', multi.predict([[new_remuneration ,new_spending_score]]))  

In [None]:
# Train and test subsets with (MLR) multiple linear regression
# Split the data in 'train' (80%) and 'test' (20%) sets.
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, y,
                                                                            test_size = 0.20)

In [None]:
# Training the model using the 'statsmodel' OLS library.
# Fit the model with the added constant.
model = sm.OLS(Y_train, sm.add_constant(X_train)).fit()

# Set the predicted response vector.
Y_pred = model.predict(sm.add_constant(X_test)) 

# Call a summary of the model.
print_model = model.summary()

# Print the summary.
print(print_model)  

In [None]:
# Specify the model.
mlr = LinearRegression()  

# Fit the model. Use training data set.
mlr.fit(X_train, Y_train)

In [None]:
# Call the predictions for X in the test set.
y_pred_mlr = mlr.predict(X_train)  

# Print the predictions.
print("Prediction for train set: {}".format(y_pred_mlr)) 

In [None]:
# Call the predictions for X in the test set.
y_pred_mlr = mlr.predict(X_test)  

# Print the predictions.
print("Prediction for test set: {}".format(y_pred_mlr)) 

In [None]:
# Print the R-squared value.
print(mlr.score(X_test, Y_test)*100)  

In [None]:
# Check for multicollinearity.
x_temp = sm.add_constant(X_train)

# Create an empty DataFrame. 
vif = pd.DataFrame()

# Calculate the VIF for each value.
vif['VIF Factor'] = [variance_inflation_factor(x_temp.values,
                                               i) for i in range(x_temp.values.shape[1])]

# Create the feature columns.
vif['features'] = x_temp.columns

# Print the values to one decimal points.
print(vif.round(1))

VIF factors for all 2 variables are close to 1, therefore multicollinearity is not a problem here.
No correlation or very weak correlation between variables were found.

In [None]:
# Determine heteroscedasticity.
model = sms.het_breuschpagan(model.resid, model.model.exog) 

In [None]:
terms = ['LM stat', 'LM Test p-value', 'F-stat', 'F-test p-value']
print(dict(zip(terms, model)))

The Lagrange multiplier statistic for the test is 38.016 and the corresponding p-value is less than 0.05, we reject the null hypothesis (homoscedasticity is present)and have sufficient evidence to say that heteroscedasticity is present in the regression model.

In [None]:
# Call the metrics.mean_absolute_error function.  
print('Mean Absolute Error (Final):', metrics.mean_absolute_error(Y_test, Y_pred))  

# Call the metrics.mean_squared_error function.
print('Mean Square Error (Final):', metrics.mean_squared_error(Y_test, Y_pred)) 

## 6. Observations and insights

After sense checking and cleaning the given dataset, we tried to find if there are any relationships between the loyalty point and spending score/remuniration/age. Ordinary Least Squares (OLS) to estimate a linear regression model and fit a linear equation to observed data was used. The results are below:

*spending vs loyalty: y_pred = -75.052663 + 33.061693 * reviews['spending_score']*

*remuneration vs loyalty: y_pred = -65.686513 + 34.187825 * reviews['remuneration']*

*age vs loyalty: y_pred = 1736.517739 - 4.012805 * reviews['age']*

Only in the first two coefficients were significant according to t-value test. The last one(age vs loyalty) where the coefficient is actualy negative as well, the probability of  the t-value is 0.058>0.05, thus its not statistically significant. 

Spending vs loyalty has R2 is 45.2% and in remuneration vs loyalty it is 38%. Both are not great to predict loyalty. In order to improve the model we can try Multivariate regression model (MRL), taking all variables into consideration.

 As a result we've got much higher R2, which means 82,8% of loyalty points variability is explained by variability of spending score, remuneration and age together. All coefficients are statistically significant, according to probability of t-value (0.00<0.05). Remuneration is 33.7573 (so if it changes 1 unit, the value of y (loyalty points) increases by 33.7573 units). Spending score cofficient is similar 32.948.
As VIF is close to 1.0 for all independant variables meaning no correlation found. We also check for heteroscedasticity, as it is a problem because ordinary least squares (OLS) regression assumes that the residuals come from a population that has homoscedasticity, which means constant variance. In our example it is present. 

# 

## How useful are remuneration and spending scores data? Defining customer groups.

# Clustering with *k*-means using Python

The marketing department also wants to better understand the usefulness of renumeration and spending scores but do not know where to begin. We tasked to identify groups within the customer base that can be used to target specific market segments. Use *k*-means clustering to identify the optimal number of clusters and then apply and plot the data using the created segments.

## Instructions
1. Prepare the data for clustering. 
    1. Import the CSV file you have prepared in Week 1.
    2. Create a new DataFrame (e.g. `df2`) containing the `renumeration` and `spending_score` columns.
    3. Explore the new DataFrame. 
2. Plot the renumeration versus spending score.
    1. Create a scatterplot.
    2. Create a pairplot.
3. Use the Silhouette and Elbow methods to determine the optimal number of clusters for *k*-means clustering.
    1. Plot both methods and explain how you determine the number of clusters to use.
    2. Add titles and legends to the plot.
4. Evaluate the usefulness of at least three values for *k* based on insights from the Elbow and Silhoutte methods.
    1. Plot the predicted *k*-means.
    2. Explain which value might give you the best clustering.
5. Fit a final model using your selected value for *k*.
    1. Justify your selection and comment on the respective cluster sizes of your final solution.
    2. Check the number of observations per predicted class.
6. Plot the clusters and interpret the model.

## 1. Load and explore the data

In [None]:
# Import necessary libraries.
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import accuracy_score
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the CSV file(s) as df2.
df2 = pd.read_csv('reviews.csv')

# View the DataFrame.
df2.head()

In [None]:
# Drop unnecessary columns.
df2 = df2.drop(['review', 'summary'], axis = 1)

# View DataFrame.
df2.head()

In [None]:
# Explore the data.
df2.info()

In [None]:
# Descriptive statistics.
df2.describe()

## 2. Plot

In [None]:
# Create a scatterplot with Seaborn.
# Import Seaborn and Matplotlib.
from matplotlib import pyplot as plt
import seaborn as sns

# Create a scatterplot with Seaborn.
sns.scatterplot(x='remuneration',
                y='spending_score',
                data=df2)

In [None]:
# Create a pairplot with Seaborn.
x = df2[['remuneration', 'spending_score']]

sns.pairplot(df2,
             vars=x,
             diag_kind='kde')

## 3. Elbow and silhoutte methods

In [None]:
# Determine the number of clusters: Elbow method.
from sklearn.cluster import KMeans

# Elbow chart for us to decide on the number of optimal clusters.
ss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,
                    init='k-means++',
                    max_iter=500,
                    n_init=10,
                    random_state=0)
    kmeans.fit(x)
    ss.append(kmeans.inertia_)

# Plot the elbow method.
plt.plot(range(1, 11),
         ss,
         marker='o')

# Insert labels and title.
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("SS distance")

plt.show()

Elbow method used to determine k number, when sum of square distances from each point to its assigned centre of the cluster is plotted on Y-axis, and X-axis is a number of clusters. 4-6 cluseters seem to fit.

In [None]:
# Determine the number of clusters: Silhouette method.
# Import silhouette_score class from sklearn.
from sklearn.metrics import silhouette_score

# Find the range of clusters to be used using silhouette method.
sil = []
kmax = 10

for k in range(2, kmax+1):
    kmeans_s = KMeans(n_clusters=k).fit(x)
    labels = kmeans_s.labels_
    sil.append(silhouette_score(x,
                                labels,
                                metric='euclidean'))

# Plot the silhouette method.
plt.plot(range(2, kmax+1),
         sil,
         marker='o')

# Insert labels and title.
plt.title("The Silhouette Method")
plt.xlabel("Number of clusters")
plt.ylabel("Sil")

plt.show()

Silhoutte method gives 5-6 clusters as optimal.

## 4. Evaluate k-means model at different values of *k*

## 4a) Use 4 clusters:

In [None]:
# Use four clusters.
kmeans = KMeans(n_clusters = 4, 
                max_iter = 15000,
                init='k-means++',
                random_state=42).fit(x)

clusters = kmeans.labels_

x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x,
             hue='K-Means Predicted',
             diag_kind= 'kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts().sort_index(0)

In [None]:
# View the K-Means predicted.
print(x.head())

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

# Create a scatterplot.
sns.scatterplot(x='remuneration',
                y='spending_score',
                data=x,
                hue='K-Means Predicted',
                palette=['red', 'green', 'blue','black'])

## 4b) Use 5 clusters:

In [None]:
# Use five clusters.
kmeans = KMeans(n_clusters = 5, 
                max_iter = 15000,
                init='k-means++',
                random_state=42).fit(x)

clusters = kmeans.labels_

x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x,
             hue='K-Means Predicted',
             diag_kind= 'kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts().sort_index(0)

In [None]:
# View the K-Means predicted.
print(x.head())

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

# Create a scatterplot.
sns.scatterplot(x='remuneration',
                y='spending_score',
                data=x,
                hue='K-Means Predicted',
                palette=['red', 'green', 'blue','black','purple'])

## 4c) Use 6 clusters:

In [None]:
# Use six clusters.
kmeans = KMeans(n_clusters = 6, 
                max_iter = 15000,
                init='k-means++',
                random_state=42).fit(x)

clusters = kmeans.labels_

x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x,
             hue='K-Means Predicted',
             diag_kind= 'kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts().sort_index(0)

In [None]:
# View the K-Means predicted.
print(x.head())

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

# Create a scatterplot.
sns.scatterplot(x='remuneration',
                y='spending_score',
                data=x,
                hue='K-Means Predicted',
                palette=['red', 'green', 'blue','black','purple','pink'])

## 5. Discuss: Insights and observations


It seems that k=5 (five clusters) might give the best results (groups) to target specific market segments based on remuneration and spending score. Cluster 0 is the largest group, as many users have remuneration and spending score about the mean value. The number of predicted values per class indicates a better distribution for k=5 than k=6 or k=4 (we compare predicted values with the data pairplot). Later scatterplot based on prediction of cluster membership is used to visually see the separation of predictive classification types based on remuneration and spending score.

## Use of social data (customer reviews and summaries) for marketing campaigns.

# NLP using Python
Customer reviews were downloaded from the website of Turtle Games. This data will be used to steer the marketing department on how to approach future campaigns. Therefore, the marketing department asked you to identify the 15 most common words used in online product reviews. They also want to have a list of the top 20 positive and negative reviews received from the website. Therefore, you need to apply NLP on the data set.

## Instructions
1. Load and explore the data. 
    1. Sense-check the DataFrame.
    2. You only need to retain the `review` and `summary` columns.
    3. Determine if there are any missing values.
2. Prepare the data for NLP
    1. Change to lower case and join the elements in each of the columns respectively (`review` and `summary`).
    2. Replace punctuation in each of the columns respectively (`review` and `summary`).
    3. Drop duplicates in both columns (`review` and `summary`).
3. Tokenise and create wordclouds for the respective columns (separately).
    1. Create a copy of the DataFrame.
    2. Apply tokenisation on both columns.
    3. Create and plot a wordcloud image.
4. Frequency distribution and polarity.
    1. Create frequency distribution.
    2. Remove alphanumeric characters and stopwords.
    3. Create wordcloud without stopwords.
    4. Identify 15 most common words and polarity.
5. Review polarity and sentiment.
    1. Plot histograms of polarity (use 15 bins) for both columns.
    2. Review the sentiment scores for the respective columns.
6. Identify and print the top 20 positive and negative reviews and summaries respectively.
7. Include your insights and observations.

## 1. Load and explore the data

In [None]:
# Import all the necessary packages.
import nltk 
import os 

nltk.download ('punkt')
nltk.download ('stopwords')

from wordcloud import WordCloud
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from textblob import TextBlob
from scipy.stats import norm

# Import Counter.
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the data set as dataframe df3.
df3 = pd.read_csv('reviews.csv')

# View the DataFrame.
df3.head()

In [None]:
# Explore data set. Define size and datatypes.
df3.info()

 Initial data set has 2000 observations and 9 columns.

In [None]:
# Keep necessary columns (only reviews and summary). Drop unnecessary columns.
df3 = df3.drop(['gender','age','remuneration','spending_score','loyalty_points','education','product'], axis=1)

# View DataFrame.
df3

In [None]:
# Determine if there are any missing values in the dataframe.
df3.isnull().sum()

## 2. Prepare the data for NLP
### 2a) Change to lower case and join the elements in each of the columns respectively (review and summary)

In [None]:
# Review column: Change all to lower case and join with a space.
df3['review'] = df3['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# View the result.
df3['review'].head()

In [None]:
# Summary column: Change all to lower case and join with a space.
df3['summary'] = df3['summary'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# View the result.
df3['summary'].head()

### 2b) Replace punctuation in each of the columns respectively (review and summary)

In [None]:
# Replace all the punctuations in review column with blank spaces.
df3['review'] = df3['review'].str.replace('[^\w\s]','')

# View output.
df3['review'].head()

In [None]:
# Replace all the puncuations in summary column with blank spaces.
df3['summary'] = df3['summary'].str.replace('[^\w\s]','')

# View output.
df3['summary'].head()

### 2c) Drop duplicates in both columns

In [None]:
# Look for duplicates in review column.
df3.review.duplicated().sum()

# Drop duplicates in review column.
df3 = df3.drop_duplicates(subset=['review'])

# View DataFrame.
df3.head()

In [None]:
# Look for duplicates in summary columns.
df3.summary.duplicated().sum()

# Drop duplicates in summary column.
df3 = df3.drop_duplicates(subset=['summary'])

# View DataFrame.
df3.reset_index(inplace=True)
df3.head()

In [None]:
df3.info()

 After all cleaning 1349 records left.

## 3. Tokenise and create wordclouds

In [None]:
# Create new DataFrame (copy DataFrame).
df4 = df3

# View DataFrame.
df4.head()

In [None]:
# Apply tokenisation to both columns (review and summary) and assign them to new columns.
df4['token_review'] = df4['review'].apply(word_tokenize)
df4['token_summary'] = df4['summary'].apply(word_tokenize)

# View DataFrame.
df4.head()

In [None]:
# Review: Create a word cloud.
# Create an empty string variable for all reviews.
all_review = ''
for i in range(df4.shape[0]):
    # Add each comment.
    all_review = all_review + df4['review'][i]

In [None]:
# Review: Plot the WordCloud image.
# Set the colour palette.
sns.set(color_codes=True)

# Create a WordCloud object (including stopwords).
word_cloud = WordCloud(width = 1600, height = 900, 
                background_color ='black',
                colormap = 'plasma', 
                stopwords = 'none',
                min_font_size = 10).generate(all_review) 

# Plot the WordCloud image.                    
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(word_cloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Summary: Create a word cloud.
# Create an empty string variable for all summaries.
all_summary = ''
for i in range(df4.shape[0]):
    # Add each comment.
    all_summary = all_summary + df4['summary'][i]

In [None]:
# Summary: Plot the WordCloud image.
# Set the colour palette.
sns.set(color_codes=True)

# Create a WordCloud object (including stopwords).
word_cloud = WordCloud(width = 1600, height = 900, 
                background_color ='white',
                colormap = 'plasma', 
                stopwords = 'none',
                min_font_size = 10).generate(all_summary) 

# Plot the WordCloud image.                    
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(word_cloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

## 4. Frequency distribution and polarity
### 4a) Create frequency distribution

In [None]:
# Create a list for tokenised review column.
# Define an empty list of tokens review.
all_tokens_review = []
for i in range(df4.shape[0]):
    # Add each token to the list.
    all_tokens_review = all_tokens_review + df4['token_review'][i]

# Calculate the frequency distribution for review column.
freq_dist_of_words = FreqDist(all_tokens_review)

# Determine 5 most common words used in reviews.
freq_dist_of_words.most_common(5)

In [None]:
# Create a list for tokenised summary column.
# Define an empty list of tokens summary.
all_tokens_summary = []
for i in range(df4.shape[0]):
    # Add each token to the list.
    all_tokens_summary = all_tokens_summary + df4['token_summary'][i]

# Calculate the frequency distribution for summary column.
freq_dist_of_words = FreqDist(all_tokens_summary)

# Determine 5 most common words used in summary.
freq_dist_of_words.most_common(5)

### 4b) Remove alphanumeric characters and stopwords

In [None]:
# Delete all the alphanum.
token_review = [word for word in all_tokens_review if word.isalnum()]
token_summary = [word for word in all_tokens_summary if word.isalnum()]

In [None]:
# Remove all the stopwords for reviews.
# Create a set of English stop words.
english_stopwords = set(stopwords.words('english'))

# Create a filtered list of token_review without stop words.
token_review2 = [x for x in token_review if x.lower() not in english_stopwords]

# Define an empty string variable for reviews.
token_review2_string = ''
for value in token_review:
    # Add each filtered token word to the string.
    token_review2_string = token_review2_string + value + ' '

In [None]:
# Remove all the stopwords for summary.

# Create a filtered list of token_review without stop words.
token_summary2 = [x for x in token_summary if x.lower() not in english_stopwords]

# Define an empty string variable.
token_summary2_string = ''
for value in token_summary:
    # Add each filtered token word to the string.
    token_summary2_string = token_summary2_string + value + ' '

### 4c) Create wordcloud without stopwords

In [None]:
# Create a wordcloud object without stopwords for reviews.
wordcloud = WordCloud(width = 1600, height = 900, 
                background_color ='black', 
                colormap='plasma', 
                min_font_size = 10).generate(token_review2_string) 

# Plot the WordCloud image.                        
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Create a wordcloud without stopwords for summary.
wordcloud = WordCloud(width = 1600, height = 900, 
                background_color ='white', 
                colormap='plasma', 
                min_font_size = 10).generate(token_summary2_string) 

# Plot the WordCloud image.                        
plt.figure(figsize = (16, 9), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis('off') 
plt.tight_layout(pad = 0) 
plt.show()

### 4d) Identify 15 most common words and polarity

In [None]:
# Determine the 15 most common words for reviews.
# View the frequency distribution for reviews.
fdist1 = FreqDist(token_review2)

# Preview the data.
fdist1

In [None]:
# Present the data for frequency of words used in reviews in more readable format.

# Import the Counter class.
from collections import Counter

# Generate a DataFrame from Counter for reviews.
count_review = pd.DataFrame(Counter(token_review2).most_common(15), 
                      columns=['Word', 'Frequency']).set_index('Word')

# Preview data.
count_review

In [None]:
# Visualise the most frequent words used in review.
# Set the plot type.
ax = count_review.plot(kind='barh', figsize=(16, 9), fontsize=12,
                 colormap ='plasma')

# Set the labels.
ax.set_xlabel('Count', fontsize=12)
ax.set_ylabel('Word', fontsize=12)
ax.set_title("Reviews: Count of the 15 most frequent words",
             fontsize=20)

# Draw the bar labels.
for i in ax.patches:
    ax.text(i.get_width()+.41, i.get_y()+.1, str(round((i.get_width()), 2)),
            fontsize=12, color='red')

In [None]:
# Generate a DataFrame from Counter for summaries.
count_summary = pd.DataFrame(Counter(token_summary2).most_common(15), 
                      columns=['Word', 'Frequency']).set_index('Word')

# Preview data.
count_summary

In [None]:
# Visualise the most frequent words used in summary.
# Set the plot type.
ax = count_summary.plot(kind='barh', figsize=(16, 9), fontsize=12,
                 colormap ='plasma')

# Set the labels.
ax.set_xlabel('Count', fontsize=12)
ax.set_ylabel('Word', fontsize=12)
ax.set_title("Summary: Count of the 15 most frequent words",
             fontsize=20)

# Draw the bar labels.
for i in ax.patches:
    ax.text(i.get_width()+.41, i.get_y()+.1, str(round((i.get_width()), 2)),
            fontsize=12, color='red')

## 5. Review polarity and sentiment: Plot histograms of polarity (use 15 bins) and sentiment scores for the respective columns.

In [None]:
# Install TextBlob.
!pip install textblob

# Import the necessary package.
from textblob import TextBlob

In [None]:
# Provided function.
def generate_polarity(comment):
    '''Extract polarity score (-1 to +1) for each comment'''
    return TextBlob(comment).sentiment[0]

## Review column analysis.

In [None]:
# Determine polarity for review column. 

# Populate a new column with polarity scores for each comment.
df4['polarity_review'] = df4['review'].apply(generate_polarity)

# Preview the result.
df4['polarity_review'].head()

The function (polarity) works by extracting the relevant score from the sentiment method for each review.

In [None]:
# Define a function to extract a subjectivity score for the review.
def generate_subjectivity(comment):
    return TextBlob(comment).sentiment[1]

# Populate a new column with subjectivity scores for each review.
df4['subjectivity_review'] = df4['review'].apply(generate_subjectivity)

# Preview the result.
df4['subjectivity_review'].head()

The function (subjectivity) works by extracting the relevant score from the sentiment method for each review.

In [None]:
# Visualise sentiment polarity scores for reviews.
# Review: Create a histogram plot with bins = 15.
# Set the number of bins.
num_bins = 15

# Set the plot area.
plt.figure(figsize=(16,9))

# Define the bars.
n, bins, patches = plt.hist(df4['polarity_review'], num_bins, facecolor='blue', alpha=0.8)

# Set the labels.
plt.xlabel('Polarity', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Review: Histogram of sentiment score polarity', fontsize=20)

plt.show()

## Summary column analysis:

In [None]:
# Determine polarity for summary column.

# Populate a new column with polarity scores for each summary.
df4['polarity_summary'] = df4['summary'].apply(generate_polarity)

# Preview the result.
df4['polarity_summary'].head()

In [None]:
# Define a function to extract a subjectivity score for the summary.
def generate_subjectivity(comment):
    return TextBlob(comment).sentiment[1]

# Populate a new column with subjectivity scores for each summary.
df4['subjectivity_summary'] = df4['summary'].apply(generate_subjectivity)

# Preview the result.
df4['subjectivity_summary'].head()

In [None]:
# Visualise sentiment polarity scores for summaries.
# Summary: Create a histogram plot with bins = 15.
# Set the number of bins.
num_bins = 15

# Set the plot area.
plt.figure(figsize=(16,9))

# Define the bars.
n, bins, patches = plt.hist(df4['polarity_summary'], num_bins, facecolor='green', alpha=0.8)

# Set the labels.
plt.xlabel('Polarity', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Summary: Histogram of sentiment score polarity', fontsize=20)

plt.show()

## 6. Identify top 20 positive and negative reviews and summaries respectively

In [None]:
# Top 20 negative reviews.
# Create a DataFrame for 20 negative reviews.
negative_sentiment = df4.nsmallest(20, 'polarity_review')

# Eliminate unnecessary columns.
negative_sentiment = negative_sentiment[['review', 'polarity_review', 'subjectivity_review']]

# View output.
negative_sentiment.style.set_properties(subset=['review'])

In [None]:
negative_sentiment.at[141, 'review']

In [None]:
# Top 20 negative summaries.
# Create a DataFrame for 20 negative summaries.
negative_sentiment = df4.nsmallest(20, 'polarity_summary')

# Eliminate unnecessary columns.
negative_sentiment = negative_sentiment[['summary', 'polarity_summary', 'subjectivity_summary']]

# View output.
negative_sentiment.style.set_properties(subset=['summary'])

In [None]:
# Top 20 positive reviews.
# Create a DataFrame for 20 most positive reviews.
positive_sentiment = df4.nlargest(20, 'polarity_review')

# Eliminate unnecessary columns.
positive_sentiment = positive_sentiment[['review', 'polarity_review', 'subjectivity_review']]

# View the output.
positive_sentiment.style.set_properties(subset=['review'])

In [None]:
# Top 20 positive summaries.
# Create a DataFrame for 20 most positive summaries.
positive_sentiment = df4.nlargest(20, 'polarity_summary')

# Eliminate unnecessary columns.
positive_sentiment = positive_sentiment[['summary', 'polarity_summary', 'subjectivity_summary']]

# View output.
positive_sentiment.style.set_properties(subset=['summary'])

## 7. Discuss: Insights and observations

Top 15 words used in reviews and summaries are mostly positive ones. Lots of mentioning game, fun, play, love etc.

Reviews: This histogram shows us that most reviews located in positive zone, so most of the users are expressing positive sentiment in their reviews.

Summary: Histogram shows us that most summaries sit in a positive part of the plot: users show a particularly strong sentiment in positive direction.

Top 20 positive and negative reviews and summaries. After checking how NLP did sentimental analysis, we can conclude, that human recheck is needed especially for the negative ones. Complaints about missing parts and product quality should be addressed. Sometimes NLP misinterprets a game description (especially when it containts agressive or violent content) as a product review as well as it doesn't pick up sarcasm or jokes.