### Project - MovieLens Data Analysis

The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. The data is widely used for collaborative filtering and other filtering solutions. However, we will be using this data to act as a means to demonstrate our skill in using Python to “play” with data.

### Datasets Information:

- Data.csv: It contains information of ratings given by the users to a particular movie. Columns: user id, movie id, rating, timestamp

- item.csv: File contains information related to the movies and its genre.

- Columns: movie id, movie title, release date, unknown, Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

- user.csv: It contains information of the users who have rated the movies. Columns: user id, age, gender, occupation, zip code

### Objective:

`To implement the techniques learnt as a part of the course.`

### Learning Outcomes:
- Exploratory Data Analysis

- Visualization using Python

- Pandas – groupby, merging 


#### Domain 
`Internet and Entertainment`

**Note that the project will need you to apply the concepts of groupby and merging extensively.**

#### 1. Import the necessary packages - 2.5 marks

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#### 2. Read the 3 datasets into dataframes - 2.5 marks

In [2]:
# reading the Data.csv, item.csv and user.csv into data frames
data = pd.read_csv('Data.csv')
item = pd.read_csv('item.csv')
user = pd.read_csv('user.csv')

#### 3. Apply info, shape, describe, and find the number of missing values in the data - 5 marks
 - Note that you will need to do it for all the three datasets seperately

In [3]:
print('#Info of data:')
print(data.info())
print('')
print('#Shape of data:')
print(data.shape)
print('')
print('#Describe of data:')
print(data.describe())
print('')
print('#Number of misisng values of each columns of data:')
print(data.isnull().sum())

#Info of data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user id    100000 non-null  int64
 1   movie id   100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB
None

#Shape of data:
(100000, 4)

#Describe of data:
            user id       movie id         rating     timestamp
count  100000.00000  100000.000000  100000.000000  1.000000e+05
mean      462.48475     425.530130       3.529860  8.835289e+08
std       266.61442     330.798356       1.125674  5.343856e+06
min         1.00000       1.000000       1.000000  8.747247e+08
25%       254.00000     175.000000       3.000000  8.794487e+08
50%       447.00000     322.000000       4.000000  8.828269e+08
75%       682.00000     631.000000       4.000000  8.882600e+08
max       943.00000    1682.000000    

In [4]:
print('#Info of item:')
print(item.info())
print('')
print('#Shape of item:')
print(item.shape)
print('')
print('#Describe of item:')
print(item.describe())
print('')
print('#Number of misisng values of each columns of item:')
print(item.isnull().sum())

#Info of item:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1681 entries, 0 to 1680
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   movie id      1681 non-null   int64 
 1   movie title   1681 non-null   object
 2   release date  1681 non-null   object
 3   unknown       1681 non-null   int64 
 4   Action        1681 non-null   int64 
 5   Adventure     1681 non-null   int64 
 6   Animation     1681 non-null   int64 
 7   Childrens     1681 non-null   int64 
 8   Comedy        1681 non-null   int64 
 9   Crime         1681 non-null   int64 
 10  Documentary   1681 non-null   int64 
 11  Drama         1681 non-null   int64 
 12  Fantasy       1681 non-null   int64 
 13  Film-Noir     1681 non-null   int64 
 14  Horror        1681 non-null   int64 
 15  Musical       1681 non-null   int64 
 16  Mystery       1681 non-null   int64 
 17  Romance       1681 non-null   int64 
 18  Sci-Fi        1681 non-null   int

In [5]:
print('#Info of user:')
print(user.info())
print('')
print('#Shape of user:')
print(user.shape)
print('')
print('#Describe of user:')
print(user.describe())
print('')
print('#Number of misisng values of each columns of user:')
print(user.isnull().sum())

#Info of user:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB
None

#Shape of user:
(943, 5)

#Describe of user:
          user id         age
count  943.000000  943.000000
mean   472.000000   34.051962
std    272.364951   12.192740
min      1.000000    7.000000
25%    236.500000   25.000000
50%    472.000000   31.000000
75%    707.500000   43.000000
max    943.000000   73.000000

#Number of misisng values of each columns of user:
user id       0
age           0
gender        0
occupation    0
zip code      0
dtype: int64


#### 4. Find the number of movies per genre using the item data - 2.5 marks

In [None]:
print(item.iloc[:, 3:].sum(axis=0))

#### 5. Find the movies that have more than one genre - 5 marks

In [None]:
item_temp = item.copy()
item_temp = item_temp.set_index('movie title')
item_temp=item_temp.drop(['movie id', 'release date'], axis=1).sum(axis=1)
item_temp = item_temp[item_temp[0:] > 1]
print(item_temp)

#### 6. Drop the movie where the genre is unknown - 2.5 marks

In [None]:
item_unknown_genre = item[item['unknown'] == 1]
print("Movies with 'unknown' genre:")
print(item_unknown_genre['movie title'])
# below dataframe has movies with unknown genre filtered 
item_known_genre = item[item['unknown'] != 1]
print(item_known_genre.shape)

### 7. Univariate plots of columns: 'rating', 'Age', 'release year', 'Gender' and 'Occupation' - 10 marks

In [None]:
sns.distplot(data['rating'],kde=False);

In [None]:
sns.distplot(user['age']);

In [None]:
sns.countplot(user['gender'], hue=user['gender']);

In [None]:
plt.figure(figsize=(30,20))
plt.xticks(rotation=45) 
sns.countplot(user['occupation']);

In [None]:
item['release date'] = pd.DatetimeIndex(item['release date']).year
item.loc[(item['release date'] > 2021),'release date']=item['release date']-100
plt.figure(figsize=(30,20))
plt.xticks(rotation=45) 
sns.countplot(item['release date']);

### 8. Visualize how popularity of genres has changed over the years - 10 marks

Note that you need to use the number of releases in a year as a parameter of popularity of a genre

Hint 

1: you need to reach to a data frame where the release year is the index and the genre is the column names (one cell shows the number of release in a year in one genre) or vice versa.
Once that is achieved, you can either use multiple bivariate plots or can use the heatmap to visualise all the changes over the years in one go. 

Hint 2: Use groupby on the relevant column and use sum() on the same to find out the nuumber of releases in a year/genre.  

In [None]:
item_temp = item.copy()
#item_temp['release date'] = item_temp['release date'].apply(lambda x: x.split('-')[2])
item_temp = item_temp.set_index('release date')
item_temp = item_temp.drop(['movie id'], axis=1)
item_new = item_temp.groupby('release date').sum()
item_new.head()
plt.figure(figsize=(30,20))
sns.heatmap(item_new, annot=True, cmap='plasma', vmin=-1, vmax=1)
plt.figure(figsize=(30,20))
sns.jointplot(data=item_new, x=item_new.index.values.tolist(), y='Action')


### 9. Find the top 25 movies according to average ratings such that each movie has number of ratings more than 100 - 10 marks

Hint : 

1. First find the movies that have more than 100 ratings(use merge, groupby and count). Extract the movie id in a list.
2. Find the average rating of all the movies and sort them in the descending order. You will have to use the .merge() function to reach to a data set through which you can get the ids and the average rating.
3. Use isin(list obtained from 1) to filter out the movies which have more than 100 ratings.

Note: This question will need you to research about groupby and apply your findings. You can find more on groupby on https://realpython.com/pandas-groupby/.

In [None]:
item_temp = item.copy()
item_new = item_temp.set_index('movie id')
data_item = pd.merge(data, item_new, how='right', on='movie id')
movie_review_data = pd.merge(data_item, user, how='right', on='user id')
movie_review_data_new = movie_review_data[movie_review_data.groupby('movie id')['movie id'].transform('count').ge(100)]
movie_review_data_new_mean = movie_review_data_new[['movie title', 'rating']].groupby('movie title').mean()
movie_review_data_new_mean_sorted = movie_review_data_new_mean.sort_values('rating', ascending=False)
print(movie_review_data_new_mean_sorted.head(25))

### 10. See gender distribution across different genres check for the validity of the below statements - 10 marks

* Men watch more drama than women
* Women watch more Sci-Fi than men
* Men watch more Romance than women


1. There is no need to conduct statistical tests around this. Just compare the percentages and comment on the validity of the above statements.

2. you might want ot use the .sum(), .div() function here.
3. Use number of ratings to validate the numbers. For example, if out of 4000 ratings received by women, 3000 are for drama, we will assume that 75% of the women watch drama.

#### Conclusion:



In [None]:
# Verify Men watch more drama than women
total_men = movie_review_data[movie_review_data['gender'] == 'M'].shape[0]
total_women = movie_review_data[movie_review_data['gender'] == 'F'].shape[0]
no_of_men_watching_drama = \
movie_review_data[(movie_review_data['gender'] == 'M') & (movie_review_data['Drama'] == 1)].shape[0]
no_of_women_watching_drama = \
movie_review_data[(movie_review_data['gender'] == 'F') & (movie_review_data['Drama'] == 1)].shape[0]
percent_of_men_watching_drama = no_of_men_watching_drama * 100 / total_men
percent_of_women_watching_drama = no_of_women_watching_drama * 100 / total_women
print('Men watch more drama than women: {}'.format((percent_of_men_watching_drama > percent_of_women_watching_drama)))
print('Percent of Men watching Drame: {} \nPercent of Women watching Drame: {} '.format(percent_of_men_watching_drama,
                                                                                        percent_of_women_watching_drama))

In [None]:
# Verify Women watch more Sci-Fi than men
no_of_men_watching_scifi = \
movie_review_data[(movie_review_data['gender'] == 'M') & (movie_review_data['Sci-Fi'] == 1)].shape[0]
no_of_women_watching_scifi = \
movie_review_data[(movie_review_data['gender'] == 'F') & (movie_review_data['Sci-Fi'] == 1)].shape[0]
percent_of_men_watching_scifi = no_of_men_watching_scifi * 100 / total_men
percent_of_women_watching_scifi = no_of_women_watching_scifi * 100 / total_women
print('Women watch more Sci-Fi than Men: {}'.format((percent_of_men_watching_scifi < percent_of_women_watching_scifi)))
print('Percent of Men watching Sci-Fi: {} \nPercent of Women watching Sci-Fi: {} '.format(percent_of_men_watching_scifi,
                                                                                          percent_of_women_watching_scifi))

In [None]:
# Verify Men watch more Romance than women
no_of_men_watching_romance = \
movie_review_data[(movie_review_data['gender'] == 'M') & (movie_review_data['Romance'] == 1)].shape[0]
no_of_women_watching_romance = \
movie_review_data[(movie_review_data['gender'] == 'F') & (movie_review_data['Romance'] == 1)].shape[0]
percent_of_men_watching_romance = no_of_men_watching_romance * 100 / total_men
percent_of_women_watching_romance = no_of_women_watching_romance * 100 / total_women
print('Men watch more Roamnce than women: {}'.format(
    (percent_of_men_watching_romance > percent_of_women_watching_romance)))
print('Percent of Men watching Roamnce: {} \nPercent of Women watching Roamnce: {} '.format(
    percent_of_men_watching_romance, percent_of_women_watching_romance))
