# IS 362 - Week 7 Assignment

### Johnny Zgombic

Choose six recent popular movies.  Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movies that they have seen on a scale of 1 to 5.  There should be at least one movie that not everyone has seen!

Take the results (observations) and store them somewhere (like a SQL database, or a .CSV file).  Load the information into a pandas dataframe.  Your solution should include Python and pandas code that accomplishes the following:

1.Load the ratings by user information that you collected into a pandas dataframe.

2.Show the average ratings for each user and each movie.

3.Create a new pandas dataframe, with normalizedratings for each user.  Again, show the average ratings for each user and each movie.

4.Provide a text-based conclusion: explain what might be advantages and disadvantages of using normalized ratings instead of the actual ratings.

The first step in our program is to import **pandas** and **numpy**.

In [10]:
import numpy as np
import pandas as pd

We will now read the csv file into **Python** using **pd.read_csv**.

In [11]:
ratings = pd.read_csv('ratings.csv', index_col = 0)
ratings

Unnamed: 0_level_0,Konata Ratings,Jenny Rating,James Rating,Gerson Rating,Hyla Rating,Jermaine Ratings,Aaron Rating,Aung Rating
Movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Spider-Man: Far from Home,4,,5.0,3,,3.0,4,3
Avengers: Endgame,5,5.0,5.0,4,3.0,,5,4
Captain Marvel,2,,2.0,1,3.0,3.0,4,2
Spider-Man: Into the Spider-Verse,5,,3.0,5,,4.0,5,5
Venom,3,,,4,,4.0,4,3
Dark Phoenix,4,5.0,,4,,5.0,3,3


We are now going to take the **Mean** of the ratings from the the rows corresponding to each movie. 

Using **mean()**, we will use **axis=1**, which are the rows corresponding to the movies and their ratings given by users, and **skipna=True** which will ignore all the *NaN* values so that we may get an accurate mean from only the reviews present. 

In [12]:
movie_average = ratings.mean(axis = 1, skipna = True)
movie_average

Movie
Spider-Man: Far from Home            3.666667
Avengers: Endgame                    4.428571
Captain Marvel                       2.428571
Spider-Man: Into the Spider-Verse    4.500000
Venom                                3.600000
Dark Phoenix                         4.000000
dtype: float64

We are now going to take the **Mean** of the ratings from the the columns corresponding to each user. 

Using **mean()**, we will use **axis=0**, which are the columns corresponding to the users and their ratings given to each movie, and **skipna=True** which will ignore all the *NaN* values so that we may get an accurate mean from only the reviews present.

In [13]:
user_average = ratings.mean(axis = 0, skipna = True)
user_average

Konata Ratings      3.833333
Jenny Rating        5.000000
James Rating        3.750000
Gerson Rating       3.500000
Hyla Rating         3.000000
Jermaine Ratings    3.800000
Aaron Rating        4.166667
Aung Rating         3.333333
dtype: float64

In this step, there are a few things going on.

First, using **fillna()**, we will replace all the *NaN* values with a *0* so that we may use it in the subtraction we need for normalization.

The next step is to use the normalization formala to get our numbers.

The final step is to now reintroduce the *NaN* values back into their original places.

In [14]:
ratings.fillna(value=0, inplace=True)
normal = (ratings - ratings.min()) / (ratings.max() - ratings.min())
normal2 = normal.replace(0, np.nan)
normal2

Unnamed: 0_level_0,Konata Ratings,Jenny Rating,James Rating,Gerson Rating,Hyla Rating,Jermaine Ratings,Aaron Rating,Aung Rating
Movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Spider-Man: Far from Home,0.666667,,1.0,0.5,,0.6,0.5,0.333333
Avengers: Endgame,1.0,1.0,1.0,0.75,1.0,,1.0,0.666667
Captain Marvel,,,0.4,,1.0,0.6,0.5,
Spider-Man: Into the Spider-Verse,1.0,,0.6,1.0,,0.8,1.0,1.0
Venom,0.333333,,,0.75,,0.8,0.5,0.333333
Dark Phoenix,0.666667,1.0,,0.75,,1.0,,0.333333


We are now going to take the **Mean** of the *normalized* ratings from the the rows corresponding to each movie. 

Using **mean()**, we will use **axis=1**, which are the rows corresponding to the movies and their *normalized* ratings given by users, and **skipna=True** which will ignore all the *NaN* values so that we may get an accurate mean from only the reviews present.

In [15]:
normal_movie_average = normal2.mean(axis = 1, skipna = True)
normal_movie_average

Movie
Spider-Man: Far from Home            0.600000
Avengers: Endgame                    0.916667
Captain Marvel                       0.625000
Spider-Man: Into the Spider-Verse    0.900000
Venom                                0.543333
Dark Phoenix                         0.750000
dtype: float64

We are now going to take the **Mean** of the *normalized* ratings from the the columns corresponding to each user. 

Using **mean()**, we will use **axis=0**, which are the columns corresponding to the users and their *normalized* ratings given to each movie, and **skipna=True** which will ignore all the *NaN* values so that we may get an accurate mean from only the reviews present.

In [16]:
normal_user_average = normal2.mean(axis = 0, skipna = True)
normal_user_average

Konata Ratings      0.733333
Jenny Rating        1.000000
James Rating        0.750000
Gerson Rating       0.750000
Hyla Rating         1.000000
Jermaine Ratings    0.760000
Aaron Rating        0.700000
Aung Rating         0.533333
dtype: float64

### Conclusion

By it's definition, the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. 

**What exactly does this mean?** 

When breaking down data into scales, looking at a range from 0-1 is a lot easier than figuring out a range from 1-10000, especially when there are thousands of rows of data to deal with. if you know that your data is normalized, then you can quickly look at a value and know its significance to the the entire dataset. A value of .00001 is very low compared to a value of .99 and you know that immediately. However, if the data was not normalized, you don't know what the maximum value is for your data so it is very complex to begin with.