## Analysis of an E-commerce Dataset Part 2

Name = MD AZIZUL HOQUE ID : 46769579

The goal of the second analysis task is to train linear regression models to predict users' ratings towards items. This involves a standard Data Science workflow: exploring data, building models, making predictions, and evaluating results. In this task, we will explore the impacts of feature selections and different sizes of training/testing data on the model performance. We will use another cleaned combined e-commerce sub-dataset that **is different from** the one in “Analysis of an E-commerce Dataset” task 1.

### Import Cleaned E-commerce Dataset
The csv file named 'cleaned_ecommerce_dataset.csv' is provided. You may need to use the Pandas method, i.e., `read_csv`, for reading it. After that, please print out its total length.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.preprocessing import OrdinalEncoder

In [2]:
#read data from csv
ecommerce = pd.read_csv("cleaned_ecommerce_dataset.csv")
ecommerce

Unnamed: 0,userId,timestamp,review,item,rating,helpfulness,gender,category,item_id,item_price,user_city
0,4081,71900,Not always McCrap,McDonald's,4.0,3.0,M,Restaurants & Gourmet,41,30.74,4
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,1.0,4.0,M,Restaurants & Gourmet,74,108.30,4
2,4081,72000,The Wonderful World of Wendy,Wendy's,5.0,4.0,M,Restaurants & Gourmet,84,69.00,4
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",5.0,3.0,M,Movies,68,143.11,4
4,4081,100399,Hey! Gimme some pie!,American Pie,3.0,3.0,M,Movies,6,117.89,4
...,...,...,...,...,...,...,...,...,...,...,...
2680,2445,22000,Great movie!,Austin Powers: The Spy Who Shagged Me,5.0,3.0,M,Movies,9,111.00,5
2681,2445,30700,Good food!,Outback Steakhouse,5.0,3.0,M,Restaurants & Gourmet,50,25.00,5
2682,2445,61500,Great movie!,Fight Club,5.0,3.0,M,Movies,26,97.53,5
2683,2445,100500,Awesome Game.,The Sims 2: Open for Business for Windows,5.0,4.0,M,Games,79,27.00,5


In [3]:
#print total length
ecommerce.shape[0]

2685

### Explore the Dataset

* Use the methods, i.e., `head()` and `info()`, to have a rough picture about the data, e.g., how many columns, and the data types of each column.
* As our goal is to predict ratings given other columns, please get the correlations between helpfulness/gender/category/review and rating by using the `corr()` method.
* To get the correlations between different features, you may need to first convert the categorical features (i.e., gender, category and review) into numerial values. For doing this, you may need to import `OrdinalEncoder` from `sklearn.preprocessing` (refer to the useful exmaples [here](https://pbpython.com/categorical-encoding.html))
* Please provide ___necessary explanations/analysis___ on the correlations, and figure out which are the ___most___ and ___least___ corrleated features regarding rating. Try to ___discuss___ how the correlation will affect the final prediction results, if we use these features to train a regression model for rating prediction. In what follows, we will conduct experiments to verify your hypothesis.

In [4]:
#using head() method
ecommerce.head(5)

Unnamed: 0,userId,timestamp,review,item,rating,helpfulness,gender,category,item_id,item_price,user_city
0,4081,71900,Not always McCrap,McDonald's,4.0,3.0,M,Restaurants & Gourmet,41,30.74,4
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,1.0,4.0,M,Restaurants & Gourmet,74,108.3,4
2,4081,72000,The Wonderful World of Wendy,Wendy's,5.0,4.0,M,Restaurants & Gourmet,84,69.0,4
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",5.0,3.0,M,Movies,68,143.11,4
4,4081,100399,Hey! Gimme some pie!,American Pie,3.0,3.0,M,Movies,6,117.89,4


In [5]:
#using info() method
ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2685 entries, 0 to 2684
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2685 non-null   int64  
 1   timestamp    2685 non-null   int64  
 2   review       2685 non-null   object 
 3   item         2685 non-null   object 
 4   rating       2685 non-null   float64
 5   helpfulness  2685 non-null   float64
 6   gender       2685 non-null   object 
 7   category     2685 non-null   object 
 8   item_id      2685 non-null   int64  
 9   item_price   2685 non-null   float64
 10  user_city    2685 non-null   int64  
dtypes: float64(3), int64(4), object(4)
memory usage: 230.9+ KB


In [6]:
ecommerce.shape[0]

2685

In [7]:
ecommerce.shape[1]

11

In [8]:
# data types of every columns
ecommerce.dtypes

userId           int64
timestamp        int64
review          object
item            object
rating         float64
helpfulness    float64
gender          object
category        object
item_id          int64
item_price     float64
user_city        int64
dtype: object

In [9]:
# display list of the columns
ecommerce.columns

Index(['userId', 'timestamp', 'review', 'item', 'rating', 'helpfulness',
       'gender', 'category', 'item_id', 'item_price', 'user_city'],
      dtype='object')

In [10]:
#converting categorical features into numerical values
ord_enc = OrdinalEncoder()
ecommerce["Gender"] = ord_enc.fit_transform(ecommerce[["gender"]])
ecommerce["Category"] = ord_enc.fit_transform(ecommerce[["category"]])
ecommerce["Review"] = ord_enc.fit_transform(ecommerce[["review"]])

In [11]:
ecommerce[["gender", "Gender"]].head(10)

Unnamed: 0,gender,Gender
0,M,1.0
1,M,1.0
2,M,1.0
3,M,1.0
4,M,1.0
5,M,1.0
6,M,1.0
7,M,1.0
8,M,1.0
9,M,1.0


In [12]:
ecommerce[["category", "Category"]].head(10)

Unnamed: 0,category,Category
0,Restaurants & Gourmet,8.0
1,Restaurants & Gourmet,8.0
2,Restaurants & Gourmet,8.0
3,Movies,5.0
4,Movies,5.0
5,Movies,5.0
6,Movies,5.0
7,Media,4.0
8,Movies,5.0
9,Restaurants & Gourmet,8.0


In [13]:
ecommerce[["review", "Review"]].head(10)

Unnamed: 0,review,Review
0,Not always McCrap,1618.0
1,I dropped the chalupa even before he told me to,1125.0
2,The Wonderful World of Wendy,2185.0
3,They actually did it,2243.0
4,Hey! Gimme some pie!,1033.0
5,Good for sci-fi,925.0
6,Scary? you bet!,1854.0
7,Fox - the 4th basic channel,795.0
8,Amen!,262.0
9,mama mia!,2643.0


In [14]:
#Finding Correlation
selected_ecommerce = ['helpfulness', 'Category', 'Review', 'rating', 'Gender']
new_dataframe = ecommerce[selected_ecommerce]
corr_matrix = new_dataframe.corr()
print(corr_matrix)

             helpfulness  Category    Review    rating    Gender
helpfulness     1.000000 -0.013408 -0.028259 -0.007523  0.075947
Category       -0.013408  1.000000  0.001970 -0.163158  0.022549
Review         -0.028259  0.001970  1.000000 -0.036118 -0.037884
rating         -0.007523 -0.163158 -0.036118  1.000000 -0.034337
Gender          0.075947  0.022549 -0.037884 -0.034337  1.000000


## Analysis on the correlations
Correlation is a simple statistical relationship between two variables or entities which indicates how they will change in relation to one another.
* **Positive correlation**: This indicates that the two variables changed in the same direction, either up or down.
* **Negative correlation**:A negative correlation is represented by the number -1 which means two variables therefore went in opposing directions.
* **zero correlation**:A correlation of zero indicates that there is no connection between the two variables. In other words, when one variable changed, another changed in a completely unrelated manner.

From the above generated table we can see that the most highly connected pair among the mentioned columns are category and review with a Negative correlation of about -0.163158 and -0.036118 while helpfulness and gender has a very weak positive correlation of approximately -0.007523 and -0.034337, making them one of the least.

Building regression models using various combinations of these characteristics and assessing their performance using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R two squared value on a validation dataset can be used to test my hypotheses.

### Split Training and Testing Data
* Machine learning models are trained to help make predictions for the future. Normally, we need to randomly split the dataset into training and testing sets, where we use the training set to train the model, and then leverage the well-trained model to make predictions on the testing set.
* To further investigate whether the size of the training/testing data affects the model performance, please random split the data into training and testing sets with different sizes:
    * Case 1: training data containing 10% of the entire data;
    * Case 2: training data containing 90% of the entire data.
* Print the shape of training and testing sets in the two cases.

In [15]:
#split training and testing data
train_c1, test_c1 = train_test_split(ecommerce, test_size=0.9, random_state=142)
train_c2, test_c2 = train_test_split(ecommerce, test_size=0.1, random_state=142)

In [16]:
#printing training and test data shape
#case 1
print(train_c1.shape)
print(test_c1.shape)
#case 2
print(train_c2.shape)
print(test_c2.shape)

(268, 14)
(2417, 14)
(2416, 14)
(269, 14)


### Train Linear Regression Models with Feature Selection under Cases 1 & 2
* When training a machine learning model for prediction, we may need to select the most important/correlated input features for more accurate results.
* To investigate whether feature selection affects the model performance, please select two most correlated features and two least correlated features regarding rating, respectively.
* Train four linear regression models by following the conditions:
    - (model-a) using the training/testing data in case 1 with two most correlated input features
    - (model-b) using the training/testing data in case 1 with two least correlated input features
    - (model-c) using the training/testing data in case 2 with two most correlated input features
    - (model-d) using the training/testing data in case 2 with two least correlated input features
* By doing this, we can verify the impacts of the size of traing/testing data on the model performance via comparing model-a and model-c (or model-b and model-d); meanwhile the impacts of feature selection can be validated via comparing model-a and model-b (or model-c and model-d).    

## Two most correlated features are category and review and two least correlated features are helpfulness and gender**

## Model-a

In [17]:
# Define input variable
X_train = train_c1[['Category','Review']]
y_train = train_c1['rating']

X_test = test_c1[['Category', 'Review']]
Ma_y_test = test_c1['rating']

In [18]:
# Build Model
reg = linear_model.LinearRegression()
# Train the model
reg.fit(X_train, y_train)

In [19]:
#predict for test data
model_a_predicted = reg.predict(X_test)

## Model-b

In [20]:
# Define input variable
X_train = train_c1[['helpfulness','Gender']]
y_train = train_c1['rating']

X_test = test_c1[['helpfulness', 'Gender']]
Mb_y_test = test_c1['rating']

In [21]:
# Build Model
reg = linear_model.LinearRegression()
# Train the model
reg.fit(X_train, y_train)

In [22]:
#predict for test data
model_b_predicted = reg.predict(X_test)

## Model-c

In [23]:
# Define input variable
X_train = train_c2[['Category','Review']]
y_train = train_c2['rating']

X_test = test_c2[['Category','Review']]
Mc_y_test = test_c2['rating']

In [24]:
# Build Model
reg = linear_model.LinearRegression()
# Train the model
reg.fit(X_train, y_train)

In [25]:
#predict for test data
model_c_predicted = reg.predict(X_test)

## Model-d

In [26]:
# Define input variable
X_train = train_c2[['helpfulness','Gender']]
y_train = train_c2['rating']

X_test = test_c2[['helpfulness','Gender']]
Md_y_test = test_c2['rating']

In [27]:
# Build Model
reg = linear_model.LinearRegression()
# Train the model
reg.fit(X_train, y_train)

In [28]:
#predict for test data
model_d_predicted = reg.predict(X_test)

### Evaluate Models
* Evaluate the performance of the four models with two metrics, including MSE and Root MSE
* Print the results of the four models regarding the two metrics

In [29]:
#Calculating MSE , ROOT MSE and R2 Squared value for model_a
#evaluating model
Model_a_mse = ((np.array(Ma_y_test) - model_a_predicted)**2).sum()/len(Ma_y_test)
Model_a_rmse = np.sqrt(Model_a_mse)
Model_a_r2 = r2_score(Ma_y_test, model_a_predicted)

print("MSE :", Model_a_mse)
print("RMSE :" , Model_a_rmse)
print("R Squared :", Model_a_r2)


MSE : 1.7690740179517055
RMSE : 1.3300654186737229
R Squared : 0.020578145218415278


In [30]:
#Calculating MSE , ROOT MSE and R2 Squared value for model_b
#evaluating model
Model_b_mse = ((np.array(Mb_y_test) - model_b_predicted)**2).sum()/len(Mb_y_test)
Model_b_rmse = np.sqrt(Model_b_mse)
Model_b_r2 = r2_score(Mb_y_test, model_b_predicted)

print("MSE :", Model_b_mse)
print("RMSE :" , Model_b_rmse)
print("R Squared :", Model_b_r2)

MSE : 1.8412549895856636
RMSE : 1.356928513071217
R Squared : -0.019383789895821568


In [31]:
#Calculating MSE , ROOT MSE and R2 Squared value for model_c
#evaluating model
Model_c_mse = ((np.array(Mc_y_test) - model_c_predicted)**2).sum()/len(Mc_y_test)
Model_c_rmse = np.sqrt(Model_c_mse)
Model_c_r2 = r2_score(Mc_y_test, model_c_predicted)

print("MSE :", Model_c_mse)
print("RMSE :" , Model_c_rmse)
print("R Squared :", Model_c_r2)

MSE : 1.7588975359805048
RMSE : 1.3262343442923294
R Squared : 0.022040319944943265


In [32]:
#Calculating MSE , ROOT MSE and R2 Squared value for model_d
#evaluating model
Model_d_mse = ((np.array(Md_y_test) - model_d_predicted)**2).sum()/len(Md_y_test)
Model_d_rmse = np.sqrt(Model_d_mse)
Model_d_r2 = r2_score(Md_y_test, model_d_predicted)

print("MSE :", Model_d_mse)
print("RMSE :" , Model_d_rmse)
print("R Squared :", Model_d_r2)

MSE : 1.8109460127732369
RMSE : 1.3457139416581954
R Squared : -0.006899007486201425


### Visualize, Compare and Analyze the Results
* Visulize the results, and perform ___insightful analysis___ on the obtained results. For better visualization, you may need to carefully set the scale for the y-axis.
* Normally, the model trained with most correlated features and more training data will get better results. Do you obtain the similar observations? If not, please ___explain the possible reasons___.

In [33]:
# Initialize data to lists.
data = [{'Model': 'Model_a', 'MSE': Model_a_mse, 'RMSE': Model_a_rmse, 'R2 Squared': Model_a_r2},
        {'Model': 'Model_b', 'MSE': Model_b_mse, 'RMSE': Model_b_rmse, 'R2 Squared': Model_b_r2},
        {'Model': 'Model_c', 'MSE': Model_c_mse, 'RMSE': Model_c_rmse, 'R2 Squared': Model_c_r2},
        {'Model': 'Model_d', 'MSE': Model_d_mse, 'RMSE': Model_d_rmse, 'R2 Squared': Model_d_r2}]
# Creates DataFrame.
df = pd.DataFrame(data)
  
# Print the data
df

Unnamed: 0,Model,MSE,RMSE,R2 Squared
0,Model_a,1.769074,1.330065,0.020578
1,Model_b,1.841255,1.356929,-0.019384
2,Model_c,1.758898,1.326234,0.02204
3,Model_d,1.810946,1.345714,-0.006899


**It is often the case that a machine learning model trained with more training data and highly correlated features tends to perform better. This may not always be the case, and there may be a number of causes for this, including: Overfitting, Irrelevant features, Collinearity, Sample Size and Data Quality, Model Complexity, Feature Engineering, Validation and Testing, Data Leakage, cross-validation. Although more training data and highly correlated features can frequently result in improved model performance, this is not guaranteed. Building strong and reliable machine learning models requires a methodical strategy that integrates feature selection, accurate model evaluation, and tuning.**