<a href="https://colab.research.google.com/github/mustafakhan817/3D_Pract/blob/main/new_Portfolio_2_questions_(2024_S1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analysis of an E-commerce Dataset Part 2

The goal of the second analysis task is to train linear regression models to predict users' ratings towards items. This involves a standard Data Science workflow: exploring data, building models, making predictions, and evaluating results. In this task, we will explore the impacts of feature selections and different sizes of training/testing data on the model performance. We will use another cleaned combined e-commerce sub-dataset that **is different from** the one in “Analysis of an E-commerce Dataset” task 1.

### Import Cleaned E-commerce Dataset
The csv file named 'cleaned_ecommerce_dataset.csv' is provided. You may need to use the Pandas method, i.e., `read_csv`, for reading it. After that, please print out its total length.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score

import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline

In [3]:
exe_df = pd.read_csv('/content/drive/MyDrive/Portfolio 2/cleaned_ecommerce_dataset.csv')

In [4]:
len(exe_df)

2685

### Explore the Dataset

* Use the methods, i.e., `head()` and `info()`, to have a rough picture about the data, e.g., how many columns, and the data types of each column.
* As our goal is to predict ratings given other columns, please get the correlations between helpfulness/gender/category/review and rating by using the `corr()` method.

  Hints: To get the correlations between different features, you may need to first convert the categorical features (i.e., gender, category and review) into numerial values. For doing this, you may need to import `OrdinalEncoder` from `sklearn.preprocessing` (refer to the useful exmaples [here](https://pbpython.com/categorical-encoding.html))
* Please provide ___necessary explanations/analysis___ on the correlations, and figure out which are the ___most___ and ___least___ corrleated features regarding rating. Try to ___discuss___ how the correlation will affect the final prediction results, if we use these features to train a regression model for rating prediction. In what follows, we will conduct experiments to verify your hypothesis.

In [None]:
exe_df.head()

In [None]:
exe_df.info()

In [7]:
from sklearn.preprocessing import OrdinalEncoder

In [12]:
ord_enc = OrdinalEncoder()
exe_df["gender code"] = ord_enc.fit_transform(exe_df[["gender"]])

In [11]:
exe_df["category code"] = ord_enc.fit_transform(exe_df[["category"]])
exe_df["review code"] = ord_enc.fit_transform(exe_df[["review"]])

In [None]:
corr_helpfulness_against_rating = exe_df['rating'].corr(exe_df['helpfulness'])
print("The correlation between helpfulness and rating" corr_helpfulness_against_rating)

In [None]:
# 1) Most Correlated Feature:
# Category: It has the highest negative correlation (-0.163158) with the rating.
# As category values increases, the rating tends to decrease.

# 2)Least Correlated Feature(s):
# Timestamp and Helpfulness: These features have very low correlations (0.000369 and -0.007523 respectively) with the rating.
# This suggests that these features might not be significant predictors of the rating.

In [None]:
# Impact on Prediction Results:
# - Category: Given its relatively high correlation, it could be an important feature for predicting ratings.
# A regression model may assign more weight to this feature during training, potentially leading to better prediction performance.

# - Timestamp and Helpfulness: Since these features have low correlations with the rating,
# including them in the regression model may not significantly improve prediction accuracy.
# The model might not rely heavily on these features for making predictions.

### Split Training and Testing Data
* Machine learning models are trained to help make predictions for the future. Normally, we need to randomly split the dataset into training and testing sets, where we use the training set to train the model, and then leverage the well-trained model to make predictions on the testing set.
* To further investigate whether the size of the training/testing data affects the model performance, please random split the data into training and testing sets with different sizes:
    * Case 1: training data containing 10% of the entire data;
    * Case 2: training data containing 90% of the entire data.
* Print the shape of training and testing sets in the two cases.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#write somethin here before next
train_case1,test_case1 = train_test_split(exe_df, test_size=, random_state=42)
train_case2,test_case2 = train_test_split(exe_df, test_size=, random_state=42)

print(train.shape)
print(test.shape)

In [None]:
X = exe_df.drop(columns=['rating'])
y = exe_df['rating']

In [None]:
#Case 1 Print noyt included
print("Case 1: Training data containing 10% of the entire data")
print("X_train shape:", X_train_10.shape)
print("y_train shape:", y_train_10.shape)
print("X_test shape:", X_test_10.shape)
print("y_test shape:", y_test_10.shape)
print()

In [None]:
#Case 2 Print not included
print("Case 2: Training data containing 90% of the entire data")
print("X_train shape:", X_train_90.shape)
print("y_train shape:", y_train_90.shape)
print("X_test shape:", X_test_90.shape)
print("y_test shape:", y_test_90.shape)

### Train Linear Regression Models with Feature Selection under Cases 1 & 2
* When training a machine learning model for prediction, we may need to select the most important/correlated input features for more accurate results.
* To investigate whether feature selection affects the model performance, please select two most correlated features and two least correlated features from helpfulness/gender/category/review regarding rating, respectively.
* Train four linear regression models by following the conditions:
    - (model-a) using the training/testing data in case 1 with two most correlated input features
    - (model-b) using the training/testing data in case 1 with two least correlated input features
    - (model-c) using the training/testing data in case 2 with two most correlated input features
    - (model-d) using the training/testing data in case 2 with two least correlated input features
* By doing this, we can verify the impacts of the size of traing/testing data on the model performance via comparing model-a and model-c (or model-b and model-d); meanwhile the impacts of feature selection can be validated via comparing model-a and model-b (or model-c and model-d).    

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
#write something her to complete first
reg_a = linear_model.LinearRegression()

X_train_a = train_case1[['', '']]
y_train_a = train_case1['rating']

X_test_a = test_case1[['', '']]
y_test_a = test_case1['rating']

reg_a.fit(X_train_a, y_train_a)

In [None]:
reg_b = linear_model.LinearRegression()

X_train_b = train_case1[['', '']]
y_train_b = train_case1['rating']

X_test_b = test_case1[['', '']]
y_test_b = test_case1['rating']

reg_b.fit(X_train_b, y_train_b)

In [None]:
reg_c = linear_model.LinearRegression()

X_train_c = train_case2[['', '']]
y_train_c = train_case2['rating']

X_test_c = test_case1[['', '']]
y_test_c = test_case1['rating']

reg_c.fit(X_train_c, y_train_c)

In [None]:
reg_d = linear_model.LinearRegression()

X_train_d = train_case2[['', '']]
y_train_d = train_case2['rating']

X_test_d = test_case1[['', '']]
y_test_d = test_case1['rating']

reg_d.fit(X_train_d, y_train_d)

### Evaluate Models
* Evaluate the performance of the four models with two metrics, including MSE and Root MSE
* Print the results of the four models regarding the two metrics

In [None]:
predicted_a = reg_a.predict(X_test_a)
mse_a = ((np.array(y_test_a)-predicted_a)**2).sum()/len(y_test_a)
r2_a = r2_score(y_test_a, predicted_a)
print("MSE:", mse_a)
print("Root MSE:", np.sqrt(mse_a))
print("R Squared:", r2_a)

In [None]:
predicted_a = reg_a.predict(X_test_a)
mse_a = ((np.array(y_test_a)-predicted_a)**2).sum()/len(y_test_a)
r2_a = r2_score(y_test_a, predicted_a)
print("MSE:", mse_a)
print("Root MSE:", np.sqrt(mse_a))
print("R Squared:", r2_a)

In [None]:
predicted_c = reg_a.predict(X_test_a)
mse_a = ((np.array(y_test_a)-predicted_a)**2).sum()/len(y_test_a)
r2_a = r2_score(y_test_a, predicted_a)
print("MSE:", mse_a)
print("Root MSE:", np.sqrt(mse_a))
print("R Squared:", r2_a)

In [None]:
predicted_d = reg_a.predict(X_test_d)
mse_d = ((np.array(y_test_d)-predicted_d)**2).sum()/len(y_test_d)
r2_d = r2_score(y_test_d, predicted_d)
print("MSE:", mse_d)
print("Root MSE:", np.sqrt(mse_d))
print("R Squared:", r2_d)

### Visualize, Compare and Analyze the Results
* Visulize the results, and perform ___insightful analysis___ on the obtained results. For better visualization, you may need to carefully set the scale for the y-axis.
* Normally, the model trained with most correlated features and more training data will get better results. Do you obtain the similar observations? If not, please ___explain the possible reasons___.

In [None]:
#for background and observations 2 diagrams


### Data Science Ethics
*Please read the following examples [Click here to read the example_1.](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) [Click here to read the example_2.](https://viborc.com/ethics-and-ethical-data-visualization-a-complete-guide/)

*Then view the picture ![My Image](figure_portfolio2.png "This is my image")
Please compose an analysis of 100-200 words that evaluates potential ethical concerns associated with the infographic, detailing the reasons behind these issues.
