In [134]:
import pandas as pd
import numpy as np

## Making a .csv file 

Instead of providing a ready-made .csv file, I have inserted code that writes the .csv file for the user with the necessary movie data. That way a new file will be generated whenever the code runs. The data is stored into a dictionary and then loaded into a pandas DataFrame, which is then made into a .csv file.

In [135]:
#Movie ratings dictionary to put into a .csv file
data = {
    "User": ["John", "Logan", "Modesto", "Malcolm", "Maurice"],
    "American Sniper": [5, 4, None, None, 5],
    "Edge of Tomorrow": [4, None, None, None, 4],
    "Groundhog Day": [3, 3, 4, 2, 4],
    "Jurassic World": [None, 3, None, None, 2],
    "Lost in Translation": [None, None, 4, 4, 3],
    "Lucy": [4, None, 4, None, 3]
}

#Making a .csv file with the data
df = pd.DataFrame(data)
df.to_csv("movie_ratings.csv", index=False)

__________________________________________________________________________________________

## The .csv file is loaded into a pandas DataFrame

This code does something similar to the reverse of what was just done. The .csv file data is read into a pandas DataFrame and the numerical values are all converted to the same data type (float).

In [136]:
#1

#Loads the .csv file we made into a pandas dataframe
movie_df = pd.read_csv("movie_ratings.csv", index_col=0)

#Converting all numerical values to float
movie_df = movie_df.astype(float)
display(movie_df)

Unnamed: 0_level_0,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,5.0,4.0,3.0,,,4.0
Logan,4.0,,3.0,3.0,,
Modesto,,,4.0,,4.0,4.0
Malcolm,,,2.0,,4.0,
Maurice,5.0,4.0,4.0,2.0,3.0,3.0


__________________________________________________________________________________________

## Showing the average ratings for each user and each movie

This code will show both the average ratings for each user and each movie by calculating the mean of every row and column, respectively. 

In [137]:
#2

#Showing the average user ratings for each movie by calculating the mean per column ignoring NaNs
movie_averages = movie_df.agg(np.mean)
#Showing the average movie ratings for each user by calculating the mean per row
user_averages = movie_df.agg(np.mean, axis=1)

print(f"These are the average user ratings for each movie: \n {movie_averages} \n")
print(f"These are the average movie ratings for each user: \n {user_averages}")

These are the average user ratings for each movie: 
 American Sniper        4.666667
Edge of Tomorrow       4.000000
Groundhog Day          3.200000
Jurassic World         2.500000
Lost in Translation    3.666667
Lucy                   3.666667
dtype: float64 

These are the average movie ratings for each user: 
 User
John       4.000000
Logan      3.333333
Modesto    4.000000
Malcolm    3.000000
Maurice    3.500000
dtype: float64


__________________________________________________________________________________________

## Showing the normalized ratings for each user and each movie. 

To normalize the ratings for each user, what's needed is to rescale all numeric values to a range of 0 to 1. This can be done using min-max normalization, which follows the formula: **normalized value= (x − min(x)) / (max(x)−min(x))**

Again, showing the average ratings for each user and each movie.

In [138]:
#3

#Normalize the ratings for each movie
normalized_df_movie = movie_df.transform(lambda x: (x - x.min()) / (x.max() - x.min()), axis=0)
print("\nThe normalized ratings for each movie: ")
display(normalized_df_movie)

#Normalize the ratings for each user
normalized_df_user = movie_df.transform(lambda x: (x - x.min()) / (x.max() - x.min()), axis=1)
print("\nThe normalized average ratings for each user: ")
display(normalized_df_user)

#Storing the normalized movie averages in a DataFrame
normalized_movie = normalized_df_movie.agg(np.mean)
normalized_df_movie = pd.DataFrame(normalized_movie)
normalized_df_movie.columns = ['Normalized Average Rating']
normalized_df_movie.index.name = 'Movie'
print("\nThe normalized average ratings for each movie: ")
display(normalized_df_movie)

#Storing the normalized user averages in a DataFrame
normalized_user = normalized_df_user.agg(np.mean, axis=1)
normalized_df_user = pd.DataFrame(normalized_user)
normalized_df_user.columns = ['Normalized Average Rating']
print("\nThe normalized average ratings for each user: ")
display(normalized_df_user)


The normalized ratings for each movie: 


Unnamed: 0_level_0,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.0,,0.5,,,1.0
Logan,0.0,,0.5,1.0,,
Modesto,,,1.0,,1.0,1.0
Malcolm,,,0.0,,1.0,
Maurice,1.0,,1.0,0.0,0.0,0.0



The normalized average ratings for each user: 


Unnamed: 0_level_0,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.0,0.5,0.0,,,0.5
Logan,1.0,,0.0,0.0,,
Modesto,,,,,,
Malcolm,,,0.0,,1.0,
Maurice,1.0,0.666667,0.666667,0.0,0.333333,0.333333



The normalized average ratings for each movie: 


Unnamed: 0_level_0,Normalized Average Rating
Movie,Unnamed: 1_level_1
American Sniper,0.666667
Edge of Tomorrow,
Groundhog Day,0.6
Jurassic World,0.5
Lost in Translation,0.666667
Lucy,0.666667



The normalized average ratings for each user: 


Unnamed: 0_level_0,Normalized Average Rating
User,Unnamed: 1_level_1
John,0.5
Logan,0.333333
Modesto,
Malcolm,0.5
Maurice,0.5


Alternatively, one can choose to handle the NaN values, and one of the way that this can be done in is by filling in the missing movie ratings with the movie averages of the rest in a new DataFrame. Afterwards, the normalization process is the same.

In [139]:
#This is optional in case the missing values need to be handled, it replaces them with the average ratings of other movies 
movie_df_filled = movie_df.apply(lambda x: x.fillna(x.mean()), axis=0)
user_df_filled = movie_df.apply(lambda x: x.fillna(x.mean()), axis=1)


#Normalize the ratings for each movie without missing values
normalized_df_movie_filled = user_df_filled.transform(lambda x: (x - x.min()) / (x.max() - x.min()), axis=0)
print("\nThe normalized ratings for each movie without missing values: ")
display(normalized_df_movie_filled)

#Normalize the ratings for each user without missing values
normalized_df_user_filled = movie_df_filled.transform(lambda x: (x - x.min()) / (x.max() - x.min()), axis=1)
print("\nThe normalized ratings for each user without missing values: ")
display(normalized_df_user_filled)

#Storing the normalized movie averages in a DataFrame
normalized_movie_filled = normalized_df_movie_filled.agg(np.mean)
normalized_df_movie_filled = pd.DataFrame(normalized_movie_filled)
normalized_df_movie_filled.index.name = 'Movie'
normalized_df_movie_filled.columns = ['Normalized Average Rating']

print("\nThe normalized average ratings for each movie without missing values: ")
display(normalized_df_movie_filled)

#Storing the normalized user averages in a DataFrame
normalized_user_filled = normalized_df_user_filled.agg(np.mean, axis=1)
normalized_df_user_filled = pd.DataFrame(normalized_user_filled)
normalized_df_user_filled.columns = ['Normalized Average Rating']

print("\nThe normalized average ratings for each user without missing values: ")
display(normalized_df_user_filled)


The normalized ratings for each movie without missing values: 


Movie,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.0,1.0,0.5,1.0,1.0,1.0
Logan,0.5,0.333333,0.5,0.5,0.333333,0.333333
Modesto,0.5,1.0,1.0,1.0,1.0,1.0
Malcolm,0.0,0.0,0.0,0.5,1.0,0.0
Maurice,1.0,1.0,1.0,0.0,0.0,0.0



The normalized ratings for each user without missing values: 


Movie,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.0,0.6,0.2,0.0,0.466667,0.6
Logan,1.0,1.0,0.0,0.0,0.666667,0.666667
Modesto,1.0,0.692308,0.692308,0.0,0.692308,0.692308
Malcolm,1.0,0.75,0.0,0.1875,0.75,0.625
Maurice,1.0,0.666667,0.666667,0.0,0.333333,0.333333



The normalized average ratings for each movie without missing values: 


Unnamed: 0_level_0,Normalized Average Rating
Movie,Unnamed: 1_level_1
American Sniper,0.6
Edge of Tomorrow,0.666667
Groundhog Day,0.6
Jurassic World,0.6
Lost in Translation,0.666667
Lucy,0.466667



The normalized average ratings for each user without missing values: 


Unnamed: 0_level_0,Normalized Average Rating
User,Unnamed: 1_level_1
John,0.477778
Logan,0.555556
Modesto,0.628205
Malcolm,0.552083
Maurice,0.5


__________________________________________________________________________________________

## Showing the standardized ratings for each user and each movie. 

To standardize the ratings for each user, what's needed is to transform the values to have zero mean and unit variance. This can be done with the formula: **standardized value = (x − mean(x)) / std(x)**
Again, showing the average ratings for each user and each movie.

For this process, I will choose to handle the NaN values as well. One of the ways that this can be done is by filling in the missing movie ratings with the movie averages of the rest in a new DataFrame. Afterwards, the standardization process is the same.

In [140]:
#In case the missing values need to be handled, they are replaced with the average ratings of other movies 
movie_df_filled = movie_df.apply(lambda x: x.fillna(x.mean()), axis=0)
user_df_filled = movie_df.apply(lambda x: x.fillna(x.mean()), axis=1)

#Standardize the ratings for each movie without missing values
standardized_df_movie_filled = user_df_filled.transform(lambda x: (x - x.mean()) / x.std(), axis=0)
print("\nThe standardized ratings for each movie without missing values: ")
display(standardized_df_movie_filled)

#Standardize the ratings for each user without missing values
standardized_df_user_filled = movie_df_filled.transform(lambda x: (x - x.mean()) / x.std(), axis=1)
print("\nThe standardized ratings for each user without missing values: ")
display(standardized_df_user_filled)

#Storing the standardized movie averages in a DataFrame
standardized_movie_filled = standardized_df_movie_filled.agg(np.mean)
standardized_df_movie_filled = pd.DataFrame(standardized_movie_filled)
standardized_df_movie_filled.index.name = 'Movie'
standardized_df_movie_filled.columns = ['Standardized Average Rating']

print("\nThe standardized average ratings for each movie without missing values: ")
display(standardized_df_movie_filled)

#Storing the standardized user averages in a DataFrame
standardized_user_filled = standardized_df_user_filled.agg(np.mean, axis=1)
standardized_df_user_filled = pd.DataFrame(standardized_user_filled)
standardized_df_user_filled.columns = ['Standardized Average Rating']

print("\nThe standardized average ratings for each user without missing values: ")
display(standardized_df_user_filled)


The standardized ratings for each movie without missing values: 


Movie,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,0.956183,0.707107,-0.239046,0.956183,0.707107,1.055009
Logan,-0.239046,-0.707107,-0.239046,-0.239046,-0.707107,-0.263752
Modesto,-0.239046,0.707107,0.956183,0.956183,0.707107,1.055009
Malcolm,-1.434274,-1.414214,-1.434274,-0.239046,0.707107,-0.923133
Maurice,0.956183,0.707107,0.956183,-1.434274,-1.414214,-0.923133



The standardized ratings for each user without missing values: 


Movie,American Sniper,Edge of Tomorrow,Groundhog Day,Jurassic World,Lost in Translation,Lucy
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.497393,0.350454,-0.796485,-1.369955,-0.031859,0.350454
Logan,0.9759,0.9759,-1.219875,-1.219875,0.243975,0.243975
Modesto,1.121708,0.193398,0.193398,-1.8953,0.193398,0.193398
Malcolm,1.178724,0.520831,-1.452846,-0.959426,0.520831,0.191885
Maurice,1.430194,0.476731,0.476731,-1.430194,-0.476731,-0.476731



The standardized average ratings for each movie without missing values: 


Unnamed: 0_level_0,Standardized Average Rating
Movie,Unnamed: 1_level_1
American Sniper,-1.776357e-16
Edge of Tomorrow,-5.329071e-16
Groundhog Day,-1.998401e-16
Jurassic World,-1.776357e-16
Lost in Translation,-5.329071e-16
Lucy,-1.065814e-15



The standardized average ratings for each user without missing values: 


Unnamed: 0_level_0,Standardized Average Rating
User,Unnamed: 1_level_1
John,-2.775558e-16
Logan,-6.846375e-16
Modesto,-1.850372e-17
Malcolm,-5.412337e-16
Maurice,1.850372e-17


__________________________________________________________________________________________

## Conclusion

Using normalized ratings instead of actual ratings has both advantages and disadvantages, depending on the context. 

**Advantages:** Normalization standardizes ratings across users to a common scale, which makes comparisons easier especially when users have different rating tendencies. It can also improve algorithm performance in machine learning and recommendation systems by preventing any one rating scale from dominating. Normalization also reduces the impact of extreme ratings, and thus, ensures for more reliable analyses, and provides consistency when comparing ratings across different datasets or groups. 

However, 

**Disadvantages:** These include the loss of the original meaning of ratings, as normalized values don’t carry the same interpretive weight. Normalization can also oversimplify important differences between users or movies, potentially masking meaningful distinctions in behavior. Varying normalization techniques can lead to inconsistent results, and handling missing ratings before normalization might introduce biases that misrepresent user preferences. 

In conclusion, while normalized ratings offer consistency and are useful for comparative analysis, they can also obscure original meanings and oversimplify user behaviors, which may be critical in certain contexts.