IS362 - Assignment 7

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movies that they have seen on a scale of 1 to 5. There should be at least one movie that not everyone has seen!

Take the results (observations) and store them somewhere (like a SQL database, or a .CSV file). Load the information into a pandas dataframe. Your solution should include Python and pandas code that accomplishes the following:
1. Load the ratings by user information that you collected into a pandas dataframe.
2. Show the average ratings for each user and each movie.
3. Create a new pandas dataframe, with normalized ratings for each user. Again, show the average ratings for each user and each movie.
4. Provide a text-based conclusion: explain what might be advantages and disadvantages of using normalized ratings instead of the actual ratings.
5. [Extra credit] Create another new pandas dataframe, with standardized ratings for each user. Once again, show the average ratings for each user and each movie.

### Python Code for Imports and Reading the Data
To begin, we will import the standard libraries needed, read in the data, and display the DataFrame.

In [1]:
# standard imports for numpy and pandas
import numpy as np
import pandas as pd

# read wide data to DataFrame, set first column as index
df = pd.read_csv('data/survey.csv', index_col=0)

# make a copy of DataFrame to preserve origanal import
survey = df.copy()

# view full DataFrame since dataset is small 
survey

Unnamed: 0,Mulan,The Invisible Man,Contagion,Avengers: Endgame,The Delta Force,An American Crime
Andrew,2.0,4,3.0,,4.0,4.0
Joel,2.0,3,3.0,5.0,,
Justin,,3,,5.0,4.0,
Kimberley,1.0,4,4.0,5.0,4.0,5.0
Kiaralys,1.0,4,,4.0,,
Moses,3.0,4,4.0,5.0,4.0,


###  Average Ratings for Each User and Each Movie
For some of the ratings, we have users that have a "NaN" value. This "np.NaN" value represesnts a movie that the person has not seen. Since we don't want to have these values included in the average computations, we will keep them as "NaN". Updating them, even to a zero value, could create some misleading averages.

In [2]:
# view average of each movie's ratings, rounded to 2 decimals
survey.apply(lambda col: col.mean()).round(2)

Mulan                1.80
The Invisible Man    3.67
Contagion            3.50
Avengers: Endgame    4.80
The Delta Force      4.00
An American Crime    4.50
dtype: float64

In [3]:
# view average of each user's ratings, rounded to 2 decimals
survey.apply(lambda row: row.mean(), axis=1).round(2)

Andrew       3.40
Joel         3.25
Justin       4.00
Kimberley    3.83
Kiaralys     3.00
Moses        4.00
dtype: float64

### Normalized Ratings
Normalization scales all numeric variables in the range 0-1. One possible formula is given below:

<img src="norm.png" alt="Norm Formula" title="Normalization" />

This normalized formula will be applied on each column since it only makes sense to normalize by movie and not by person.

In [4]:
# copy original DataFrame
df_norm = df.copy()

# apply normalization formula as a function
df_norm = df_norm.apply(lambda x: (x-x.min()) / (x.max() - x.min()))

# view full DataFrame since dataset is small 
df_norm

Unnamed: 0,Mulan,The Invisible Man,Contagion,Avengers: Endgame,The Delta Force,An American Crime
Andrew,0.5,1.0,0.0,,,0.0
Joel,0.5,0.0,0.0,1.0,,
Justin,,0.0,,1.0,,
Kimberley,0.0,1.0,1.0,1.0,,1.0
Kiaralys,0.0,1.0,,0.0,,
Moses,1.0,1.0,1.0,1.0,,


In [5]:
# view average of each movie's ratings, rounded to 2 decimals
df_norm.apply(lambda col: col.mean()).round(2)

Mulan                0.40
The Invisible Man    0.67
Contagion            0.50
Avengers: Endgame    0.80
The Delta Force       NaN
An American Crime    0.50
dtype: float64

In [6]:
# view average of each user's ratings, rounded to 2 decimals
df_norm.apply(lambda row: row.mean(), axis=1).round(2)

Andrew       0.38
Joel         0.38
Justin       0.50
Kimberley    0.80
Kiaralys     0.33
Moses        1.00
dtype: float64

Normalization scales data between the range of 0 and 1, regardless of the initial values. By doing this, we have a managable range that we can use to present our data. Normalizing data helps curb outliners and provide a more "in-line" set of numbers for our data. However, the normalized data can be misleading if all of the data within a dataframe column or panda series is the same value. This will cause it to result to "np.NaN". For example, in my dataset, the move "The Delta Force" was rated with 4's from each person. The normalized formula caused a division by zero, which in turn makes it seem like there is no results for that specific movie.

### Standardized Ratings
Standardization transforms the data to have zero mean and unit variance, for example using the equation below:
<img src="stand.png" alt="Stand Formula" title="Standardization" />

This standardized formula will be applied on each column.

In [7]:
# copy original DataFrame
df_stand = df.copy()

# apply normalization formula as a function
df_stand = df_stand.apply(lambda x: (x-x.mean()) / (x.std()))

# view full DataFrame since dataset is small 
df_stand

Unnamed: 0,Mulan,The Invisible Man,Contagion,Avengers: Endgame,The Delta Force,An American Crime
Andrew,0.239046,0.645497,-0.866025,,,-0.707107
Joel,0.239046,-1.290994,-0.866025,0.447214,,
Justin,,-1.290994,,0.447214,,
Kimberley,-0.956183,0.645497,0.866025,0.447214,,0.707107
Kiaralys,-0.956183,0.645497,,-1.788854,,
Moses,1.434274,0.645497,0.866025,0.447214,,


In [8]:
# view average of each movie's ratings, rounded to 2 decimals
df_stand.apply(lambda col: col.mean()).round(2)

Mulan               -0.0
The Invisible Man    0.0
Contagion            0.0
Avengers: Endgame    0.0
The Delta Force      NaN
An American Crime    0.0
dtype: float64

In [9]:
# view average of each user's ratings, rounded to 2 decimals
df_stand.apply(lambda row: row.mean(), axis=1).round(2)

Andrew      -0.17
Joel        -0.37
Justin      -0.42
Kimberley    0.34
Kiaralys    -0.70
Moses        0.85
dtype: float64