<h1 align="center"> Data Analysis with Python</h1>

### About Dataset

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

### Data Dictionary

UserID – 4848 customers who provided a rating for each movie<br>
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users


### Data Considerations

All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.<br>
Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.


## Analysis tasks

        1.Which movies have maximum views/ratings?
		2.What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
		3.Define the top 5 movies with the least audience.


## Recommendation model building

Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

    1.Divide the data into training and test data
    2.Build a recommendation model on training data
    3.Make predictions on the test data


# Start Analysis

#### Import libraries required

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
lm=LinearRegression()

### Importing data and assigning it to a dataframe called data

In [6]:
data = pd.read_csv("Amazon - Movies and TV Ratings.csv")

### see the overview of data by checking top rows

In [7]:
data.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


In [9]:
data.describe()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,...,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0
mean,5.0,5.0,2.0,5.0,4.103448,4.0,5.0,5.0,5.0,5.0,...,3.8,5.0,5.0,4.625,4.333333,4.333333,3.0,4.375,4.628571,4.923077
std,,,,0.0,1.496301,,,,,,...,1.643168,0.0,,0.517549,1.154701,1.632993,,1.407886,0.910259,0.27735
min,5.0,5.0,2.0,5.0,1.0,4.0,5.0,5.0,5.0,5.0,...,1.0,5.0,5.0,4.0,3.0,1.0,3.0,1.0,1.0,4.0
25%,5.0,5.0,2.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,4.0,4.0,5.0,3.0,4.75,5.0,5.0
50%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
75%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


### First step we count the NaN values

In [21]:
col_list = list(range(1,207))
row_list = list(range(0,4848))

In [72]:
movies=[]
null_values=[]

In [73]:
col="Movie"
for x in col_list:
    null_values.append(data[col+str(x)].isna().sum())
    movies.append(col+str(x))
dictionary_null={"Movies":movies,"Null_values":null_values}
null_values_in_movie = pd.DataFrame(dictionary_null)

In [74]:
null_values_in_movie["Reviews"]=4848-null_values_in_movie["Null_values"]


In [75]:
null_values_in_movie.head(20)

Unnamed: 0,Movies,Null_values,Reviews
0,Movie1,4847,1
1,Movie2,4847,1
2,Movie3,4847,1
3,Movie4,4846,2
4,Movie5,4819,29
5,Movie6,4847,1
6,Movie7,4847,1
7,Movie8,4847,1
8,Movie9,4847,1
9,Movie10,4847,1


### Now we created the dataframe which has the details of null values in each column/movie or we can say no of reviews a movie got

In [76]:
null_values_in_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 3 columns):
Movies         206 non-null object
Null_values    206 non-null int64
Reviews        206 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.9+ KB


In [77]:
null_values_in_movie.describe()

Unnamed: 0,Null_values,Reviews
count,206.0,206.0
mean,4823.728155,24.271845
std,168.937841,168.937841
min,2535.0,1.0
25%,4843.0,1.0
50%,4846.0,2.0
75%,4847.0,5.0
max,4847.0,2313.0


### outcomes from above 
    1.Every movie has atleast one review
    2.25 percentile of movies have only one review
    3.50 percentile of movies have only two reviews
    4.75 percentile of movies have only 5 reviews
    5.Top movie has 2313 reviews
    6.From this we can say there are many outliers because more percentile has less number and only few has very large number

In [78]:
question1_data=null_values_in_movie.sort_values(by="Reviews", ascending=False)
question1_data.reset_index(drop=True,inplace=True)

In [80]:
question1_data

Unnamed: 0,Movies,Null_values,Reviews
0,Movie127,2535,2313
1,Movie140,4270,578
2,Movie16,4528,320
3,Movie103,4576,272
4,Movie29,4605,243
5,Movie91,4720,128
6,Movie92,4747,101
7,Movie89,4765,83
8,Movie158,4782,66
9,Movie108,4794,54


    Que 1: Which movies have maximum views/ratings?

In [81]:
question1_data.head(3)

Unnamed: 0,Movies,Null_values,Reviews
0,Movie127,2535,2313
1,Movie140,4270,578
2,Movie16,4528,320


### we prepare data for 2nd question

In [83]:
sum_of_ratings=[]
for x in col_list:
    sum_of_ratings.append(data[col+str(x)].sum())
null_values_in_movie["Sum_of_ratings"]=sum_of_ratings
null_values_in_movie["Average_rating"]=null_values_in_movie["Sum_of_ratings"]/null_values_in_movie["Reviews"]

In [88]:
question2_data=null_values_in_movie.sort_values(by="Reviews", ascending=False)
question2_data.reset_index(drop=True,inplace=True)

    Que 2: What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

In [89]:
null_values_in_movie.head(10)

Unnamed: 0,Movies,Null_values,Reviews,Sum_of_ratings,Average_rating
0,Movie1,4847,1,5.0,5.0
1,Movie2,4847,1,5.0,5.0
2,Movie3,4847,1,2.0,2.0
3,Movie4,4846,2,10.0,5.0
4,Movie5,4819,29,119.0,4.103448
5,Movie6,4847,1,4.0,4.0
6,Movie7,4847,1,5.0,5.0
7,Movie8,4847,1,5.0,5.0
8,Movie9,4847,1,5.0,5.0
9,Movie10,4847,1,5.0,5.0


In [90]:
question2_data.head()

Unnamed: 0,Movies,Null_values,Reviews,Sum_of_ratings,Average_rating
0,Movie127,2535,2313,9511.0,4.111976
1,Movie140,4270,578,2794.0,4.83391
2,Movie16,4528,320,1446.0,4.51875
3,Movie103,4576,272,1241.0,4.5625
4,Movie29,4605,243,1168.0,4.806584


    Que 3: 3.Define the top 5 movies with the least audience.

In [91]:
question1_data.tail()

Unnamed: 0,Movies,Null_values,Reviews
201,Movie54,4847,1
202,Movie116,4847,1
203,Movie115,4847,1
204,Movie55,4847,1
205,Movie1,4847,1


The least audience is 1 and there are many movies with only one audience , 5 are shown above as per question