## Analysis of an E-commerce Dataset Part 2

The goal of the second analysis task is to train linear regression models to predict users' ratings towards items. This involves a standard Data Science workflow: exploring data, building models, making predictions, and evaluating results. In this task, we will explore the impacts of feature selections and different sizes of training/testing data on the model performance. We will use another cleaned combined e-commerce sub-dataset that **is different from** the one in “Analysis of an E-commerce Dataset” task 1.

### Import Cleaned E-commerce Dataset
The csv file named 'cleaned_ecommerce_dataset.csv' is provided. You may need to use the Pandas method, i.e., `read_csv`, for reading it. After that, please print out its total length.

In [113]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import r2_score
import matplotlib.pylab as plt
%matplotlib inline
#importing Encoder
from sklearn.preprocessing import OrdinalEncoder
encoder= OrdinalEncoder()
#importing train_test_split
from sklearn.model_selection import train_test_split

#reading the dataset
df = pd.read_csv('Cleaned_Ecommerce_Dataset.csv')
#printing the original dataset
print("Original length of data set : ",len (df))

Original length of data set :  2685


### Explore the Dataset

* Use the methods, i.e., `head()` and `info()`, to have a rough picture about the data, e.g., how many columns, and the data types of each column.
* As our goal is to predict ratings given other columns, please get the correlations between helpfulness/gender/category/review and rating by using the `corr()` method.

  Hints: To get the correlations between different features, you may need to first convert the categorical features (i.e., gender, category and review) into numerial values. For doing this, you may need to import `OrdinalEncoder` from `sklearn.preprocessing` (refer to the useful exmaples [here](https://pbpython.com/categorical-encoding.html))
* Please provide ___necessary explanations/analysis___ on the correlations, and figure out which are the ___most___ and ___least___ corrleated features regarding rating. Try to ___discuss___ how the correlation will affect the final prediction results, if we use these features to train a regression model for rating prediction. In what follows, we will conduct experiments to verify your hypothesis.

In [3]:
#using head function
print(df.head())

   userId  timestamp                                           review  \
0    4081      71900                                Not always McCrap   
1    4081      72000  I dropped the chalupa even before he told me to   
2    4081      72000                     The Wonderful World of Wendy   
3    4081     100399                             They actually did it   
4    4081     100399                             Hey! Gimme some pie!   

                                 item  rating  helpfulness gender  \
0                          McDonald's     4.0          3.0      M   
1                           Taco Bell     1.0          4.0      M   
2                             Wendy's     5.0          4.0      M   
3  South Park: Bigger, Longer & Uncut     5.0          3.0      M   
4                        American Pie     3.0          3.0      M   

                category  item_id  item_price  user_city  
0  Restaurants & Gourmet       41       30.74          4  
1  Restaurants & Gourmet    

In [27]:
#using info function 
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2685 entries, 0 to 2684
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2685 non-null   int64  
 1   timestamp    2685 non-null   int64  
 2   review       2685 non-null   object 
 3   item         2685 non-null   object 
 4   rating       2685 non-null   float64
 5   helpfulness  2685 non-null   float64
 6   gender       2685 non-null   object 
 7   category     2685 non-null   object 
 8   item_id      2685 non-null   int64  
 9   item_price   2685 non-null   float64
 10  user_city    2685 non-null   int64  
dtypes: float64(3), int64(4), object(4)
memory usage: 230.9+ KB
None


In [43]:

#converting categories into numercial values to perform the correlation 

df = pd.read_csv("Cleaned_Ecommerce_Dataset.csv",header=None, names=headers, na_values="?" )

#converting it to Nan data type-
df.dtypes
obj_df = df.select_dtypes(include=['object']).copy()
obj_df[obj_df.isnull().any(axis=1)]
obj_df.head()

Unnamed: 0,userId,timestamp,review,item,rating,helpfulness,gender,category,item_id,item_price,user_city
0,userId,timestamp,review,item,rating,helpfulness,gender,category,item_id,item_price,user_city
1,4081,71900,Not always McCrap,McDonald's,4.0,3.0,M,Restaurants & Gourmet,41,30.74,4
2,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,1.0,4.0,M,Restaurants & Gourmet,74,108.3,4
3,4081,72000,The Wonderful World of Wendy,Wendy's,5.0,4.0,M,Restaurants & Gourmet,84,69.0,4
4,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",5.0,3.0,M,Movies,68,143.11,4


In [21]:
#establishing correlation between helpfulness and rating 

c1= df['helpfulness']. corr(df['rating'])
print("Correlation between helpfulness and ratings : ", c1)


Correlation between helpfulness and ratings :  -0.007523337726844546


In [47]:
#Converting gender column to numerical values

df['gender']= encoder.fit_transform(df[['gender']])
#establishing correlation between gender and rating 
c2= df['gender']. corr(df['rating'])
print("Correlation between helpfulness and ratings : ", c2)


Correlation between helpfulness and ratings :  -0.03433661424208265


In [48]:
#Converting gender column to numerical values

df['category']= encoder.fit_transform(df[['category']])
#establishing correlation between gender and rating 
c3= df['category']. corr(df['rating'])
print("Correlation between helpfulness and ratings : ", c3)


Correlation between helpfulness and ratings :  -0.16315765340915656


In [49]:
#Converting gender column to numerical values
df['review']= encoder.fit_transform(df[['review']])
#establishing correlation between gender and rating 
c4= df['review']. corr(df['rating'])
print("Correlation between helpfulness and ratings : ", c4)


Correlation between helpfulness and ratings :  -0.036118386552122385


In [115]:
#most and least correlated features- 

#Analysis-How correlation will affect the result of predictions:

The most correlated feature with rating will have the strongest influence on the results of the predictions.
If the correlation is positive which would mean that the rating goes higher as the value of that particular features increases and vice versa when the correlation is negative.
According to the solution below (highest and lowest two correlated features) when the helpfulness and gender diversity increases the rating go higher whereas when the category diversity and review decrease the rating go higher. 

### Split Training and Testing Data
* Machine learning models are trained to help make predictions for the future. Normally, we need to randomly split the dataset into training and testing sets, where we use the training set to train the model, and then leverage the well-trained model to make predictions on the testing set.
* To further investigate whether the size of the training/testing data affects the model performance, please random split the data into training and testing sets with different sizes:
    * Case 1: training data containing 10% of the entire data;
    * Case 2: training data containing 90% of the entire data.
* Print the shape of training and testing sets in the two cases.

In [54]:
#Splitting the data randomly into 10% and 90%
#for 10%
train_data_case1, test_data_case1= train_test_split(df, test_size= 0.9, train_size= 0.1,random_state=42)
print("Shape of training data in Case 1: ",train_data_case1.shape)
print("Shape of training data in Case 1: ",test_data_case1.shape)

Shape of training data in Case 1:  (268, 11)
Shape of training data in Case 2:  (2417, 11)


In [56]:
#for 90%

train_data_case2, test_data_case2= train_test_split(df, test_size= 0.1, train_size= 0.2,random_state=42)
print("Shape of training data in Case 1: ",train_data_case2.shape)
print("Shape of training data in Case 2: ",test_data_case2.shape)

Shape of training data in Case 1:  (537, 11)
Shape of training data in Case 2:  (269, 11)


### Train Linear Regression Models with Feature Selection under Cases 1 & 2
* When training a machine learning model for prediction, we may need to select the most important/correlated input features for more accurate results.
* To investigate whether feature selection affects the model performance, please select two most correlated features and two least correlated features from helpfulness/gender/category/review regarding rating, respectively.
* Train four linear regression models by following the conditions:
    - (model-a) using the training/testing data in case 1 with two most correlated input features
    - (model-b) using the training/testing data in case 1 with two least correlated input features
    - (model-c) using the training/testing data in case 2 with two most correlated input features
    - (model-d) using the training/testing data in case 2 with two least correlated input features
* By doing this, we can verify the impacts of the size of traing/testing data on the model performance via comparing model-a and model-c (or model-b and model-d); meanwhile the impacts of feature selection can be validated via comparing model-a and model-b (or model-c and model-d).    

In [108]:
#Train linear regression model-
#importing legit packages-
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectKBest
#selecting columns for the model-
features = ['helpfulness','gender','category', 'review']

#converting features to Nan type
encoder = OrdinalEncoder()
df[features]= encoder.fit_transform(df[features])

#sorting the correlations wrt to rating
correlations = df[features + ['rating']].corr()['rating'].sort_values()
print(correlations)

#two-most co-related features
most_corr_features = correlations[-3:-1].index.tolist()
print("Most correlation features are : ", most_corr_features)
least_corr_features = correlations [1:3].index.tolist()
print("Least correlation features are : ",least_corr_features)



category      -0.163158
review        -0.036118
gender        -0.034337
helpfulness   -0.007523
rating         1.000000
Name: rating, dtype: float64
Most correlation features are :  ['gender', 'helpfulness']
Least correlation features are :  ['review', 'gender']


In [127]:
#Splitting the data
X_train_10, x_test_10, y_train_10, y_test_10 = train_test_split(df[most_corr_features + least_corr_features], df['rating'], test_size=0.9, random_state=42)
X_train_90, x_test_90, y_train_90, y_test_90 = train_test_split(df[most_corr_features + least_corr_features], df['rating'], test_size=0.1, random_state=42)


In [125]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply LabelEncoder to each column
for feature in features:
    df[feature] = label_encoder.fit_transform(df[feature])



In [146]:
#Creating models-
models = {}
#modelA
model_a = LinearRegression()
model_a.fit(X_train_10[most_corr_features],y_train_10)
models ['ModelA']= {'model':model_a,'X_test':X_test_10[most_corr_features], 'y_test': y_test_10}

#modelB
model_b = LinearRegression()
models ['ModelB']= {'model':model_b,'X_test':X_test_10[least_corr_features], 'y_test': y_test_10}

#modelC
model_c = LinearRegression()
model_c.fit(X_train_90[most_corr_features],y_train_90)
models ['ModelC']= {'model':model_c,'X_test':X_test_90[most_corr_features], 'y_test': y_test_90}

#modelD
model_d = LinearRegression()
model_d.fit(X_train_90[most_corr_features],y_train_90)
models ['ModelC']= {'model':model_c,'X_test':X_test_90[least_corr_features], 'y_test': y_test_90}


### Evaluate Models
* Evaluate the performance of the four models with two metrics, including MSE and Root MSE
* Print the results of the four models regarding the two metrics

In [147]:
#Evaluating models-
#making a result libraray
results = {}


for model_name, model_data in models.items():
        model = model_data['model']
        X_test= model_data['X_test']
        Y_test= model_data['y_test']
        y_pred= model.predict(X_test)
        mse = mean_squared_error(Y_test, y_pred)
        root_mse = np.sqrt(mse)
        results[model_name]= {'MSE':mse, 'RSME': root_mse}
        
for model_name, metrics in results.items():
    print(f"{model_name} - MSE: {metrics['MSE']:.4f}, RMSE: {metrics['RMSE']:.4f}")

ValueError: could not convert string to float: 'M'

In [148]:
# Construct models dictionary
models = {}

# Model-a: Using the training/testing data in case 1 with two most correlated input features
models['ModelA'] = {'model': model_a, 'X_test': X_test_10[most_corr_features], 'y_test': y_test_10}

# Model-b: Using the training/testing data in case 1 with two least correlated input features
models['ModelB'] = {'model': model_b, 'X_test': X_test_10[least_corr_features], 'y_test': y_test_10}

# Model-c: Using the training/testing data in case 2 with two most correlated input features
models['ModelC'] = {'model': model_c, 'X_test': X_test_90[most_corr_features], 'y_test': y_test_90}

# Model-d: Using the training/testing data in case 2 with two least correlated input features
models['ModelD'] = {'model': model_d, 'X_test': X_test_90[least_corr_features], 'y_test': y_test_90}


In [161]:
#Evaluating the performance of four models including MSE and root MSE
from sklearn.metrics import mean_squared_error
import numpy as np

# Evaluate models
results = {}

for model_name, model_data in models.items():
    model = model_data['model']
    X_test = model_data['X_test']
    y_test = model_data['y_test']
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    results[model_name] = {'MSE': mse, 'RMSE': rmse}

# Print results
for model_name, metrics in results.items():
    print(results)

ValueError: could not convert string to float: 'M'

### Visualize, Compare and Analyze the Results
* Visulize the results, and perform ___insightful analysis___ on the obtained results. For better visualization, you may need to carefully set the scale for the y-axis.
* Normally, the model trained with most correlated features and more training data will get better results. Do you obtain the similar observations? If not, please ___explain the possible reasons___.

In [159]:
import matplotlib.pyplot as plt

# Create bar plot for RMSE
plt.figure(figsize=(10, 6))
plt.bar(model_names, rmse, color='r', alpha=0.7)
plt.xlabel('Models')
plt.ylabel('RMSE')
plt.title('Root Mean Squared Error (RMSE) for Different Models')
plt.show()

# Create bar plot for MSE
plt.figure(figsize=(10, 6))
plt.bar(model_names, mse, color='r', alpha=0.7)
plt.xlabel('Models')
plt.ylabel('MSE')
plt.title('Root Mean Squared Error (RMSE) for Different Models')
plt.show()



NameError: name 'rmse' is not defined

<Figure size 1000x600 with 0 Axes>

Explanation - Whether the model trained with most correlated features and more training data will get better results.

Yes, similar observations were obtained because training the methods results in picking the most relevant data hence the results obtained are refined. We can also solve more compex problems by using this as tarining makes the model more effective. It also enhances the features on the model and ensures that the model is correlated with the most-relevant features only.



### Data Science Ethics
*Please read the following examples [Click here to read the example_1.](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) [Click here to read the example_2.](https://viborc.com/ethics-and-ethical-data-visualization-a-complete-guide/)

*Then view the picture ![My Image](figure_portfolio2.png "This is my image")
Please compose an analysis of 100-200 words that evaluates potential ethical concerns associated with the infographic, detailing the reasons behind these issues.


Potential ethical concerns associated with infographic -

In the two photos what it seems like is that only the gold colum is sorted without putting filters on the columns which is china which has won the highest gold medals is on top. However because the other columns are not sorted which is why the numbers presented in the other columns are incorrect. It seems as if China won 21 Silver and 28 medals which is incorrect. For other countries the sum. of total number of medals doesn't match because of incorrect sorting of data.